Ghazali, Nadirah (2006) Support Vector Machines (SVM) in Test Extraction. [Final Year Project] (Unpublished)
2006 - Support Vector Machine (SVM) in Test Extraction.pdf
Download (2MB)
Abstract
Text categorization is the process of grouping documents or words into predefined
categories. Each category consists of documents or words having similar attributes.
There exist numerous algorithms to address the need of text categorization including
Naive Bayes, k-nearest-neighbor classifier, and decision trees. In this project, Support
Vector Machines (SVM) is studied and experimented by the implementation ofa textual
extractor. This algorithm is used to extract important points from a lengthy document,
by which it classifies each word in the document under its relevant category and
constructs the structure of the summary with reference to the categorized words. The
performance of the extractor is evaluated using a similar corpus against an existing
summarizer, which uses a different kind of approach. Summarization is part of text
categorization whereby it is considered an essential part of today's information-led
society, and it has been a growing area of research for over 40 years. This project's
objective is to create a summarizer, or extractor, based on machine learning algorithms,
which are namely SVM and K-Means. Each word in the particular document is
processed by both algorithms to determine its actual occurrence in the document by
which it will first be clustered or grouped into categories based on parts of speech (verb,
noun, adjective) which is done by K-Means, then later processed by SVM to determine
the actual occurrence of each word in each of the cluster, taking into account whether
the words have similar meanings with otherwords in the subsequent cluster. The corpus
chosen to evaluate the application is the Reuters-21578 dataset comprising of
newspaper articles. Evaluation of the applications are carried out against another
accompanying system-generated extract which is already in the market, as a means to
observe the amount of sentences overlap with the tested applications, in this case, the
Text Extractor and also Microsoft Word AutoSummarizer. Results show that the Text
Extractor has optimal results at compression rates of 10 - 20% and 35 - 45%
Item Type: | Final Year Project |
---|---|
Subjects: | T Technology > T Technology (General) |
Departments / MOR / COE: | Sciences and Information Technology |
Depositing User: | Users 2053 not found. |
Date Deposited: | 22 Oct 2013 11:49 |
Last Modified: | 25 Jan 2017 09:46 |
URI: | http://utpedia.utp.edu.my/id/eprint/9323 |