Optimal Feature Subset Selection Based on Combining Document Frequency and Term Frequency for Text Classification

Thirumoorthy Karpagalingam

Department of Computer Science and Engineering Mepco Schlenk Engineering College, Sivakai Tamilnadu, India
Muneeswaran Karuppaiah

Department of Computer Science and Engineering Mepco Schlenk Engineering College, Sivakai Tamilnadu, India

Optimal Feature Subset Selection Based on Combining Document Frequency and Term Frequency for Text Classification

keywords: Feature selection, text classification, document frequency, term frequency

Feature selection plays a vital role to reduce the high dimension of the feature space in the text document classification problem. The dimension reduction of feature space reduces the computation cost and improves the text classification system accuracy. Hence, the identification of a proper subset of the significant features of the text corpus is needed to classify the data in less computational time with higher accuracy. In this proposed research, a novel feature selection method which combines the document frequency and the term frequency (FS-DFTF) is used to measure the significance of a term. The optimal feature subset which is selected by our proposed work is evaluated using Naive Bayes and Support Vector Machine classifier with various popular benchmark text corpus datasets. The experimental outcome confirms that the proposed method has a better classification accuracy when compared with other feature selection techniques.

reference: Vol. 39, 2020, No. 5, pp. 881–906

doi: 10.31577/cai_2020_5_881

Computing and Informatics

formerly Computers and Artificial Intelligence

Optimal Feature Subset Selection Based on Combining Document Frequency and Term Frequency for Text Classification