作者
Wareesa Sharif
发表日期
2019
机构
Universiti Tun Hussein Onn Malaysia
简介
Classification has become an important task for automatically classifying the documents to their respective categories. For text classification, feature selection techniques are normally used to identify important features and to remove irrelevant, and noisy features for minimizing the dimensionality of feature space. These techniques are expected particularly to improve efficiency, accuracy, and comprehensibility of the classification models in text labeling problems. Most of the feature selection techniques utilize document and term frequencies to rank a term. Existing feature selection techniques (e.g. RDC, NRDC) consider frequently occurring terms and ignore rarely occurring terms count in a class. However, this study proposes the Improved Relative Discriminative Criterion (IRDC) technique which considers rarely occurring terms count. It is argued that rarely occurring terms count are also meaningful and important as frequently occurring terms in a class. The proposed IRDC is compared to the most recent feature selection techniques RDC and NRDC. The results reveal significant improvement by the proposed IRDC technique for feature selection in terms of precision 27%, recall 30%, macro-average 35% and micro- average 30%. Additionally, this study also proposes a hybrid algorithm named: Ringed Seal Search-Support Vector Machine (RSS-SVM) to improve the generalization and learning capability of the SVM. The proposed RSS-SVM optimizes kernel and penalty parameter with the help of RSS algorithm. The proposed RSS-SVM is compared to the most recent techniques GA-SVM and CS-SVM. The results show significant improvement …
引用总数