USING LATENT SEMANTIC INDEXING FOR DOCUMENT CLUSTERING

MUFLIKHAH, LAILIL (2010) USING LATENT SEMANTIC INDEXING FOR DOCUMENT CLUSTERING. Masters thesis, UNIVERSITI TEKNOLOGI PETRONAS.

[thumbnail of Thesis_Lailil(G00639).pdf]

Preview

PDF
Thesis_Lailil(G00639).pdf
Download (5MB)

Abstract

Documents with various contents are easily obtained from URLs which are associated with their titles. However, the titles of documents may not describe their contents and they just attract the readers to buy and read them. Therefore, the document clustering based on the same category is important to help users to retrieve information they need. Document clustering is an implementation of data mining task. By using similarity measurement of documents‟ characteristic, they can be clustered based on the same category or topic. High dimensionality of the document representation is due to representing of all substantial words in the vector space model. It is one of problems in document clustering that decreases the cluster quality performance including f-measure, entropy and accuracy. In categorical domain, many research have been conducted to reduce the dimension size of term-document matrix representation until by using keyword base. However, the result is obtained low accuracy in various class sizes of document collections. Therefore, this research is intended to improve the quality and accuracy of document clustering by using a method in information retrieval. A method in information retrieval, Latent Semantic Indexing (LSI), is proposed to reduce the dimension of term-document matrix for document representation. In this work, the LSI method is used to produce the patterns of terms, so that documents can be mapped into concept space. Based on the new representation, the documents are then subjected to the clustering algorithm itself, which is Fuzzy c-Means algorithm. A variant of distance measurement, cosine similarity, is also embedded to this algorithm. The results are then compared with some existing algorithms, which are used for benchmark purposes. The results show that the proposed method obtains high quality cluster and it is superior to the other fuzzy clustering algorithms for category i.e. FCCM, FSKWIC, and Fuzzy CoDoK with accuracy rate of over 90%.

Item Type:	Thesis (Masters)
Departments / MOR / COE:	Sciences and Information Technology
Depositing User:	Users 5 not found.
Date Deposited:	05 Jun 2012 08:30
Last Modified:	25 Jan 2017 09:43
URI:	http://utpedia.utp.edu.my/id/eprint/2901

Actions (login required)

: View Item