Chi-Square Statistic and Principal Component Analysis Based Compressed Feature Selection Approach for Naïve Bayesian Classifier
Many of the machine learning algorithms are based on an assumption of attribute independency and often used in domains where the assumption doesn’t hold true. Naïve Bayesian (NB) classifier makes assumption that all the features are conditionally independent given the class labels; In this paper, attribute dependencies were analyzed using Chi-Square test and the Principal Component Analysis (PCA) was carried out on the whole dataset to get a set of features. We have also applied PCA on the independent attributes of the data with a view to get a more compressed set of features that may lead to reliable accuracy for NB Classifier. The performance of the classifier was experimented for the combined approach as well as individual Chi-Square and PCA only approach with variation in dataset sizes. It was found that, for the used dataset, reduced dimensionality of the dataset according to Chi-Square independency test as well as the combined approach has come out with much better performance than PCA only approach, but considering the time, the combined approach is better.
Cite this Article
Biprodip Pal, Sadia Zaman, Md. Abu Hasan et al. Chi-Square Statistic and Principal Component Analysis Based Compressed Feature Selection Approach for Naïve Bayesian Classifier. Journal of Artificial Intelligence Research & Advances. 2015; 2(2): 16–23p.
Kotsiantis, S. B. Supervised Machine Learning: A Review of Classification Techniques. Informatica.2007; 31: 249–268p.
Kohavi, R. Scaling up the accuracy of naive-Bayes classifiers: A decision-tree hybrid. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining Portland, OR: AAAI Press. 1996: 202–207p. 3. Qin, Biao, Yuni Xia, Fang Li. A Bayesian classifier for uncertain data. In Proceedings of the 2010 ACM Symposium on Applied Computing, ACM. 2010: 1010–1014p.
Chan K.C.C., Wong A.K.C. A Statistical Technique for Extracting Classificatory Knowledge from Databases. Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J. Frawley, eds., Cambridge, Mass.: AAAI/MIT Press. 1991: 107–123p.
Kaufman K., Michalski R.S., Kerschberg L. Mining for Knowledge in Data: Goals and General Description of the INLEN System. IJCAI-89 Workshop on Knowledge Discovery in Databases, Detroit, MI. 1989.
Imam I. F., Michalski R. S., Kerschberg L. Discovering attribute dependence in databases by integrating symbolic learning and statistical analysis techniques. In Proceeding of the AAAI-93 workshop on knowledge discovery in databases.1993.
Stein G, Chen B, Wu AS, Hua KA. Decision tree classifier for network intrusion detection with GA- based feature selection. In: Proceedings of the 43rd annual southeast regional conference ACM. 2005; 2:136–141p.
Wang X, Yang J et al. Feature selection based on rough sets and particle swarm optimization. Pattern Recogn Lett. 2007; 28: 459–471p.
Murphy, K.P. Naive Bayes classifiers: http://www.cs.ubc.ca/murphyk/Teaching/CS340-Fall06/reading/NB.pdf
Han J., Kamber M., Pei J. Data mining: concepts and techniques. Morgan kaufmann. 2006.
Kazmierska Joanna, Julian Malicki. Application of the Naive Bayesian Classifier to optimize treatment decisions. Radiotherapy and Oncology. 2008; 86(2): 211–216p.
Kotsiantis Sotiris, Dimitris Kanellopoulos. Discretization techniques: A recent survey. GESTS International Transactions on Computer Science and Engineering. 2006; 32(1): 47–58p.
Muhlenbach Fabrice, Ricco Rakotomalala. Discretization of continuous attributes. Encyclopedia of Data Warehousing and Mining. 2005; 1: 397–402p.
Yang Ying, Geoffrey I. Webb. A comparative study of discretization methods for naive-bayes classifiers. Proceedings of PKAW. 2002.
Zibran, M. Chi-Squared test of independence.University of Calgary, Canada. Retrieved from pages. cpsc. ucalgary. ca/~ saul/wiki/uploads/CPSC681/topic-fahim-CHI-Square. pdf. 2012.
Easton Valerie J., John H. McColl. Statistics glossary. Steps. 1997.
Peaeson E., H. Haetlet. Biometrika tables for statisticians. Biometrika Trust. 1976.
Smith, Lisa F., Zandra S. Gratz, Suzanne G. Bousquet. The art and practice of statistics. CengageBrain. Com.2008.
Jeong D. H., Ziemkiewicz C., Ribarsky W., Chang R., Center C. V.Understanding Principal Component Analysis Using a Visual Analytics Tool. Charlotte Visualization Center, UNC Charlotte. 2009.
Abdi H, Williams LJ, Principal component analysis. Statistics & data mining series, Wiley, New York. 2010; 2: 433–459p.
Lakhina S, Joseph S, Verma B, Feature reduction using principal component analysis for effective anomaly-based intrusion detection on NSL–KDD. International Journal of Engineering Science & Technology. 2010; 2(6): 1790–1799p.
UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease
- There are currently no refbacks.
This site has been shifted to https://stmcomputers.stmjournals.com/