Improving text classification accuracy using topic modeling over an additional corpus

S Banerjee - Proceedings of the 31st annual international ACM …, 2008 - dl.acm.org
Proceedings of the 31st annual international ACM SIGIR conference on …, 2008dl.acm.org
The World Wide Web has many document repositories that can act as valuable sources of
additional data for various machine learning tasks. In this paper, we propose a method of
improving text classification accuracy by using such an additional corpus that can easily be
obtained from the web. This additional corpus can be unlabeled and independent of the
given classification task. The method proposed here uses topic modeling to extract a set of
topics from the additional corpus. Those extracted topics then act as additional features of …
The World Wide Web has many document repositories that can act as valuable sources of additional data for various machine learning tasks. In this paper, we propose a method of improving text classification accuracy by using such an additional corpus that can easily be obtained from the web. This additional corpus can be unlabeled and independent of the given classification task. The method proposed here uses topic modeling to extract a set of topics from the additional corpus. Those extracted topics then act as additional features of the data of the given classification task. An evaluation on the RCV1 dataset shows significant improvement over a baseline method.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果