Abstract
Big Data means a very large amount of data and includes a range of methodologies such as big data collection, processing, storage, management, and analysis. Since Big Data Text Mining extracts a lot of features and data, clustering and classification can result in high computational complexity and the low reliability of the analysis results. In particular, a TDM (Term Document Matrix) obtained through text mining represents term-document features but features a sparse matrix. In this paper, the study focuses on selecting a set of optimized features from the corpus. A Genetic Algorithm (GA) is used to extract terms (features) as desired according to term importance calculated by the equation found. The study revolves around feature selection method to lower computational complexity and to increase analytical performance.We designed a new genetic algorithm to extract features in text mining. TF-IDF is used to reflect document-term relationships in feature extraction. Through the repetitive process, features are selected as many as the predetermined number. We have conducted clustering experiments on a set of spammail documents to verify and to improve feature selection performance. And we found that the proposal FSGA algorithm shown better performance of Text Clustering and Classification than using all of features.
Original language | English |
---|---|
Pages (from-to) | 22-40 |
Number of pages | 19 |
Journal | International Journal of Advances in Soft Computing and its Applications |
Volume | 7 |
Issue number | 1 |
State | Published - 2015 |
Keywords
- Big data
- Feature selection
- Genetic algorithm
- Text clustering
- Text mining