The feature selection method based on genetic algorithm for efficient of text clustering and text classification

Sung Sam Hong, Wanhee Lee, Myung Mook Han

Research output: Contribution to journalArticlepeer-review

52 Scopus citations

Abstract

Big Data means a very large amount of data and includes a range of methodologies such as big data collection, processing, storage, management, and analysis. Since Big Data Text Mining extracts a lot of features and data, clustering and classification can result in high computational complexity and the low reliability of the analysis results. In particular, a TDM (Term Document Matrix) obtained through text mining represents term-document features but features a sparse matrix. In this paper, the study focuses on selecting a set of optimized features from the corpus. A Genetic Algorithm (GA) is used to extract terms (features) as desired according to term importance calculated by the equation found. The study revolves around feature selection method to lower computational complexity and to increase analytical performance.We designed a new genetic algorithm to extract features in text mining. TF-IDF is used to reflect document-term relationships in feature extraction. Through the repetitive process, features are selected as many as the predetermined number. We have conducted clustering experiments on a set of spammail documents to verify and to improve feature selection performance. And we found that the proposal FSGA algorithm shown better performance of Text Clustering and Classification than using all of features.

Original languageEnglish
Pages (from-to)22-40
Number of pages19
JournalInternational Journal of Advances in Soft Computing and its Applications
Volume7
Issue number1
StatePublished - 2015

Keywords

  • Big data
  • Feature selection
  • Genetic algorithm
  • Text clustering
  • Text mining

Fingerprint

Dive into the research topics of 'The feature selection method based on genetic algorithm for efficient of text clustering and text classification'. Together they form a unique fingerprint.

Cite this