Document classification using machine learning

A Basarkar - 2017 - scholarworks.sjsu.edu
A Basarkar
2017scholarworks.sjsu.edu
To perform document classification algorithmically, documents need to be represented such
that it is understandable to the machine learning classifier. The report discusses the different
types of feature vectors through which document can be represented and later classified.
The project aims at comparing the Binary, Count and TfIdf feature vectors and their impact
on document classification. To test how well each of the three mentioned feature vectors
perform, we used the 20-newsgroup dataset and converted the documents to all the three …
Abstract
To perform document classification algorithmically, documents need to be represented such that it is understandable to the machine learning classifier. The report discusses the different types of feature vectors through which document can be represented and later classified. The project aims at comparing the Binary, Count and TfIdf feature vectors and their impact on document classification. To test how well each of the three mentioned feature vectors perform, we used the 20-newsgroup dataset and converted the documents to all the three feature vectors. For each feature vector representation, we trained the Naïve Bayes classifier and then tested the generated classifier on test documents. In our results, we found that TfIdf performed 4% better than Count vectorizer and 6% better than Binary vectorizer if stop words are removed. If stop words are not removed, then TfIdf performed 6% better than Binary vectorizer and 11% better than Count vectorizer. Also, Count vectorizer performs better than Binary vectorizer, if stop words are removed by 2% but lags behind by 5% if stop words are not removed. Thus, we can conclude that TfIdf should be the preferred vectorizer for document representation and classification.
scholarworks.sjsu.edu
以上显示的是最相近的搜索结果。 查看全部搜索结果