Welcome To UTPedia

We would like to introduce you, the new knowledge repository product called UTPedia. The UTP Electronic and Digital Intellectual Asset. It stores digitized version of thesis, dissertation, final year project reports and past year examination questions.

Browse content of UTPedia using Year, Subject, Department and Author and Search for required document using Searching facilities included in UTPedia. UTPedia with full text are accessible for all registered users, whereas only the physical information and metadata can be retrieved by public users. UTPedia collaborating and connecting peoples with university’s intellectual works from anywhere.

Disclaimer - Universiti Teknologi PETRONAS shall not be liable for any loss or damage caused by the usage of any information obtained from this web site.Best viewed using Mozilla Firefox 3 or IE 7 with resolution 1024 x 768.

USING LATENT SEMANTIC INDEXING FOR DOCUMENT CLUSTERING

MUFLIKHAH, LAILIL (2010) USING LATENT SEMANTIC INDEXING FOR DOCUMENT CLUSTERING. Masters thesis, UNIVERSITI TEKNOLOGI PETRONAS.

[img]
Preview
PDF
Download (5Mb) | Preview

Abstract

Documents with various contents are easily obtained from URLs which are associated with their titles. However, the titles of documents may not describe their contents and they just attract the readers to buy and read them. Therefore, the document clustering based on the same category is important to help users to retrieve information they need. Document clustering is an implementation of data mining task. By using similarity measurement of documents‟ characteristic, they can be clustered based on the same category or topic. High dimensionality of the document representation is due to representing of all substantial words in the vector space model. It is one of problems in document clustering that decreases the cluster quality performance including f-measure, entropy and accuracy. In categorical domain, many research have been conducted to reduce the dimension size of term-document matrix representation until by using keyword base. However, the result is obtained low accuracy in various class sizes of document collections. Therefore, this research is intended to improve the quality and accuracy of document clustering by using a method in information retrieval. A method in information retrieval, Latent Semantic Indexing (LSI), is proposed to reduce the dimension of term-document matrix for document representation. In this work, the LSI method is used to produce the patterns of terms, so that documents can be mapped into concept space. Based on the new representation, the documents are then subjected to the clustering algorithm itself, which is Fuzzy c-Means algorithm. A variant of distance measurement, cosine similarity, is also embedded to this algorithm. The results are then compared with some existing algorithms, which are used for benchmark purposes. The results show that the proposed method obtains high quality cluster and it is superior to the other fuzzy clustering algorithms for category i.e. FCCM, FSKWIC, and Fuzzy CoDoK with accuracy rate of over 90%.

Item Type: Thesis (Masters)
Subject: UNSPECIFIED
Divisions: Sciences and Information Technology
Depositing User: Users 5 not found.
Date Deposited: 05 Jun 2012 08:30
Last Modified: 25 Jan 2017 09:43
URI: http://utpedia.utp.edu.my/id/eprint/2901

Actions (login required)

View Item View Item

Document Downloads

More statistics for this item...