Open Access Open Access  Restricted Access Subscription or Fee Access

Efficient Classification of Noisy Text

Mita K. Dalal

Abstract


Textual content comprises a significant volume of data generated online on a daily basis. The web-generated data often consists of high levels of noise due to a variety of factors.  Development of efficient systems for automatic classification of noisy data is a crucial task in text mining. This paper examines a technique for classification of noisy text which is based on multiple feature selection and supervised learning. The main aim of the paper is to examine the efficiency of the text classification approach against increasing levels of word error rate, using both standard and web-based data sets.  Empirical evaluation of the classification approach indicates that it is efficient and reliable in the presence of noise.

Keywords


text classification, lexical noise, machine learning, naïve Bayesian classification, acronym expansion, feature selection

Full Text:

PDF

References


Vinciarelli A. Noisy text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 12, Dec. 2005, 1882-1895p.

Agarwal S., Godbole S. Punjani D. and Roy S. How much noise is too much: A study in Automatic Text Classification. Proceedings of the 7th IEEE International Conference on Data Mining, Omaha, Oct. 2007, 3-12p.

Dey L. and S. K. Mirajul Haque. Studying the effects of noisy text on text mining applications. Proceedings of the 3rd workshop on Analytics for Noisy Unstructured Text Data, Barcelona, July 2009, 107-114p.

Kim S., Han K., Rim H. and Myaeng S. H. Some effective techniques for Naive Bayes Text Classification. IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 11, Nov 2006, 1457-1466p.

Lee, K. et al. Twitter trending topic classification. Proceedings of the 11th IEEE International Conference on Data Mining Workshops, Vancouver, 2011, 251-258p.

Dalal M. K. and Zaveri M. A. Automatic Classification of Unstructured Blog Text. Journal of Intelligent Learning Systems and Applications, vol. 5, no. 2, May 2013, 108-114p.

Meena M. J. and Chandran K. R. Naive Bayes text classification with positive features selected by statistical method. Proceedings of the IEEE International Conference on Advanced Computing, Chennai, 2009, 28-33p.

Zhang W., Yoshida T. and Tang X. TF-IDF, LSI and Multi-word in information retrieval and text categorization. Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Singapore, Oct. 2008, 108-113p.

Zhang W., Yoshida T., and Tang X. Text classification using multi-word features. Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Montreal, Oct 2007, 3519-3524p.

Fu Z., Chen C., Gong Y., and Bie R. A comparison study: Web pages categorization with Bayesian classifiers. Proceedings of the 10th IEEE International Conference on High Performance Computing and Communication, Dalian, Sept. 2008, 789-794p.

Xu Y. A comparative study on feature selection in unbalanced text classification. Proceedings of the 4th International Symposium on Information Science and Engineering, Shanghai, Dec. 2012, 44-47p.

Jones K. S. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, vol. 60, no. 5, 2004, 493-502p.

Jones K. S. IDF Term Weighting and IR Research Lessons. Journal of Documentation, vol. 60, no. 5, 2004, 521-523p.

Yang Y. and Pederson J. O. A comparative study on Feature Selection in Text Categorization. Proceedings of the 14th International Conference on Machine Learning, Nashville, July 1997, 412-420p.

Deerwester S., Dumais S. T., Furnas G. W., Landauer T. K., and Harshman R., "Indexing by Latent Semantic Analysis," Journal of American Society of Information Science, vol. 41, no. 6, 1990, 391-407p.

Church K. W. and Hanks P. Word association norms, mutual information and lexicography. Computational Linguistics, vol. 16, no. 1, 1990, 22-29p.

Dalal M. K. and Zaveri M. A. Automatic text classification of sports blog data. Proceedings of the IEEE International Conference on Computing, Communications and Applications, Hongkong, Jan 2012, 219-222p.

Zhang B., Xu M., and Wu M. Research on web filtering technology based on the dual feature selection. Proceedings of the 3rd IEEE International Conference on Network Infrastructure and Digital Content, Beijing, Sept. 2012, 675-679p.

Simeon M. and Hilderman R. An empirical study of category skew on feature selection for text categorization. Proceedings of the 22nd Canadian Conference on Artificial Intelligence, Springer LNCS 5549, Kelowna, Canada, May 2009, 249-252p.

Ben-Hur A. and Weston J. A user's guide to support vector machines. Data Mining Techniques for the Life Sciences (Springer), 2009, ch. 13, 223-239p.

Platt J. C. Fast training of support vector machines using sequential minimal optimization. Advances in kernel methods: MIT Press Cambridge, USA, 1999, 185-208p.

Wang Z., He Y., and Jiang M. A comparison among three neural networks for text classification. Proceedings of the 8th IEEE International Conference on Signal Processing, Beijing, Nov. 2006, 1883-1886p.

Ratinov L. and Gudes E. Abbreviation Expansion in Schema Matching and Web Integration. Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, Beijing, Sept. 2004, 485-489p.

Taghva K. and Gilbreth J. Recognizing acronyms and their definitions. International Journal on Document Analysis and Recognition, vol. 1, no. 4, pp. 191-198, May 1999.

USPTO Patent Full Text Database. [Online].

http://patft.uspto.gov/netahtml/PTO/help/stopword.htm

Porter M. F.. An algorithm for suffix stripping. Program, vol. 14, no. 3, 1980, 130-137p.

Hall M. et al. The WEKA Data Mining Software: An Update. ACM SIGKDD Explorations, vol. 11, no. 1, 2009, 10-18p.

Witten I. H., Frank E., Hall, M. A. and Pal C. J. The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques", 4th ed., Morgan Kaufmann, 2016.

Ng A. Y. and Jordan M. I. On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes. Proceedings of the 14th International Conference on Neural Information Processing Systems, vol. 14, Vancouver, Dec 2001, 841-848p.

Damerau F. J. A technique for computer detection and correction of spelling errors. Communications of the ACM, vol. 7, no. 3, 1964, 171-176p.


Refbacks

  • There are currently no refbacks.


This site has been shifted to https://stmcomputers.stmjournals.com/