Feature Selection and the Preservation of Infrequent and Highly Significant Attributes in the Context of Arabic Text Mining

Saeed Raheel

Research Article

Feature Selection and the Preservation of Infrequent and Highly Significant Attributes in the Context of Arabic Text Mining

by Saeed Raheel

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 106 - Issue 3

Published: November 2014

Authors: Saeed Raheel

10.5120/18503-9572

PDF

Saeed Raheel . Feature Selection and the Preservation of Infrequent and Highly Significant Attributes in the Context of Arabic Text Mining. International Journal of Computer Applications. 106, 3 (November 2014), 31-36. DOI=10.5120/18503-9572

                        @article{ 10.5120/18503-9572,
                        author  = { Saeed Raheel },
                        title   = { Feature Selection and the Preservation of Infrequent and Highly Significant Attributes in the Context of Arabic Text Mining },
                        journal = { International Journal of Computer Applications },
                        year    = { 2014 },
                        volume  = { 106 },
                        number  = { 3 },
                        pages   = { 31-36 },
                        doi     = { 10.5120/18503-9572 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2014
                        %A Saeed Raheel
                        %T Feature Selection and the Preservation of Infrequent and Highly Significant Attributes in the Context of Arabic Text Mining%T 
                        %J International Journal of Computer Applications
                        %V 106
                        %N 3
                        %P 31-36
                        %R 10.5120/18503-9572
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Effective feature selection is a key component for building an efficient automatic document classifier. We regularly encounter in the Arabic literature- especially the scientific one- infrequent non-Arabic words that are eliminated by practice during the pre-processing phase. Although infrequent, those words are highly pertinent to their documents and, thus, can contribute to build a more efficient classification model and enforce the subjectivity of the decision taken by the classifier. Therefore, we propose in this paper four different feature selection solutions that allow both preserving a maximum number of those words and getting satisfactory classification accuracy.

References

Abbès, Ramzi et Dichy, Joseph, « Extraction automatique de fréquences lexicales en arabe et analyse d'un corpus journalistique avec le logiciel AraConc et la base de connaissances DIINAR. 1 » in : Heiden, Serge et Bénédicte Pincemain, Actes des JADT 2008, 9esjournées internationales d'analyse statistique des données textuelles (Proceedings of JADT 2008, 9th International Conference on Textual Data statistical Analysis).
Cornuéjols, A. et Miclet, L. Apprentissage Artificiel : Méthodes et Algorithmes. Eyrolles 2002.
Dichy J. , Braham A. , Ghazali S. , Hassoun M. , "La base de connaissances linguistiques DIINAR. 1 (DIctionnaire INformatisé de l'Arabe, version 1)", paper presented at the International Symposium on The Processing of Arabic, Tunis (La Manouba), 18-20 April 2002.
Fawcett, T. An Introduction to ROC Analysis. In ROC Analysis in Pattern Recognition, Vol. 27, No. 8. (June 2006), pp. 861-874.
Feng, S. L. and Manmatha, R. 2005. Classification Models for Historical Manuscript Recognition. In Proceedings of the Eighth international Conference on Document Analysis and Recognition (August 31 - September 01, 2005). ICDAR. IEEE Computer Society, Washington, DC, 528-532.
Forman, G. 2003. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3 (Mar. 2003), 1289-1305.
Forman G. Computational Methods of Feature Selection. CRC Press/Taylor and Francis Group. 2007.
Yoav Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121(2) :256–285, 1995.
Freund, Y. , et Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.
Freund, Y. , et Shapire, R. E. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14(5) :771-780, September, 1999.
Jalam, R. (2003) "Apprentissage automatique et catégorisation de textes multilingues". Thèse de doctorat, Université Lumière Lyon 2.
Joachims, T. (2002) Learning to Classify Text Using Support Vector Machines : Methods, Theory and Algorithms. Kluwer Academic Publishers.
Khreisat, L. Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study, Proceedings of the 2006 International Conference on Data Mining. Las Vegas, USA, 2006, pp. 78-82.
Kim, Y. , Hahn, S. , and Zhang, B. 2000. Text filtering by boosting naive Bayes classifiers. In Proceedings of the 23rd Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Athens, Greece, July 24 - 28, 2000). SIGIR '00. ACM, New York, NY, 168-175.
Kotsiantis S. Supervised Machine Learning : A Review of Classification Techniques, Informatica Journal 31 (2007) 249-268.
Mitchell, T. M. (1997). Machine Learning Computer Science. McGraw-Hill. New York.
István Pilászy. Text Categorization and Support Vector Machines. In the Proceedings of the 6th International Symposium of Hungarian Researchers on Computational Intelligence
Plantié M. , Roche M. , Dray G. , EGC 2008 : Un système de vote pour la classification de textes d'opinion, Laboratoire LGI2P, Laboratoire LIRMM.
Quinlan, J. R. Induction of decision trees. Machine Learning, 1(1) :81–106, 1986.
Quinlan, J. R. (1993) C4. 5 : Programs for Machine Learning. Morgan Kaufmann Publishers Inc.
Raheel, S. L'organisation des Connaissances et la Recherche d'Information Textuelles par l'Application des Méthodes Statistiques. 7ème colloque du chapitre français de l'ISKO. Lyon, France.
Raheel S. , J. Dichy, M. Hassoun. The Automatic Categorization of Arabic Documents by Boosting Decision Trees. In the proceedings of the 5th International IEEE/ACM Conference on Signal-Image Technology and Internet-Based Systems , IEEE CS Press, Marrakech, Morocco, November, 2009.
Raheel, S. , and Dichy J. Reducing Data Sparsity in a Language Dependent Automatic Classification of Arabic Documents. In the proceedings of the 3rd. IEEE International Conference on Information Systems and Economic Intelligence, Sousse, Tunisia, 2010. Pages : 37-46.
Raheel, S. , and Dichy, J. An Empirical Study on the Feature's Type Effect on the Automatic Classification of Arabic Documents. In the proceedings of the 11th International Conference on Intelligent Text Processing and Computational Linguistics. Ia?i, Romania. 2010. Springer LNCS 6008, Pages : 673-686.
Rakotomalala R. , "Arbres de Décision", Revue MODULAD, n°33, pp. 163-187, 2005.
Schapire, R. E. et Singer, Y. (2000). BOOSTEXTER : a boosting-based system for text categorization. Machine Learning, 39(2/3) : 135-168.
Sebag, M. et Gallinari, P. 2002. Apprentissage Artificiel: Acquis, Limites et Enjeux. In J. Le Maître, editor, Assises 2002 : Information - Interaction - Intelligence. Cépaduès, 2002.
Sebastiani, F. (1999). A tutorial on automated text categorization. Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, Buenos Aires, AR, 1999, pp. 7-35.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1) : 1. 47.
Shannon C. E. , The communication theory of secrecy systems, Bell System Technical Journal 28 (1949) (4), pp. 656–715.
Witten, I. H. and Frank, E. (2005) Data mining : practical machine learning tools and techniques. (second ed). Morgan Kaufmann, San Francisco, CA.
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval, 1 (1/2) : 60-69.
Yang Y. , Pederson J. , 1997. A comparative study on feature selection in text categorization. In J. D. H. Fisher, editor, The Fourteenth International Conference on Machine Learning (ICML'97), page 412-420. Morgan Kaufmann.
Zighed, D. A. et Rakotomalala, R. (2000). Graphes d'induction. Apprentissage et Data Mining. Hermes Science Publication, Paris.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Arabic Text mining Machine Learning Dimensionality Reduction Automatic Classification