Design and Evaluation of a Parallel Classifier for Large-Scale Arabic Text

Mohammed M. Abu Tair; Rebhi S. Baraka

Research Article

Design and Evaluation of a Parallel Classifier for Large-Scale Arabic Text

by Mohammed M. Abu Tair, Rebhi S. Baraka

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 75 - Issue 3

Published: August 2013

Authors: Mohammed M. Abu Tair, Rebhi S. Baraka

10.5120/13090-0370

PDF

Mohammed M. Abu Tair, Rebhi S. Baraka . Design and Evaluation of a Parallel Classifier for Large-Scale Arabic Text. International Journal of Computer Applications. 75, 3 (August 2013), 13-20. DOI=10.5120/13090-0370

                        @article{ 10.5120/13090-0370,
                        author  = { Mohammed M. Abu Tair,Rebhi S. Baraka },
                        title   = { Design and Evaluation of a Parallel Classifier for Large-Scale Arabic Text },
                        journal = { International Journal of Computer Applications },
                        year    = { 2013 },
                        volume  = { 75 },
                        number  = { 3 },
                        pages   = { 13-20 },
                        doi     = { 10.5120/13090-0370 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2013
                        %A Mohammed M. Abu Tair
                        %A Rebhi S. Baraka
                        %T Design and Evaluation of a Parallel Classifier for Large-Scale Arabic Text%T 
                        %J International Journal of Computer Applications
                        %V 75
                        %N 3
                        %P 13-20
                        %R 10.5120/13090-0370
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Text classification has become one of the most important techniques in text mining. A number of machine learning algorithms have been introduced to deal with automatic text classification. One of the common classification algorithms is the k-NN algorithm which is known to be one of the best classifiers applied for different languages including Arabic language. However, the k-NN algorithm is of low efficiency because it requires a large amount of computational power. Such a drawback makes it unsuitable to handle a large volume of text documents with high dimensionality and in particular in the Arabic language. This paper introduces a high performance parallel classifier for large-scale Arabic text that achieves the enhanced level of speedup, scalability, and accuracy. The parallel classifier is based on the sequential k-NN algorithm. The classifier has been tested using the OSAC corpus. The performance of the parallel classifier has been studied on a multicomputer cluster. The results indicate that the parallel classifier has very good speedup and scalability and is capable of handling large documents collections with higher classification results.

References

Feldman R. , and Sanger J. , The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, 2007.
Hill T. , and Lewicki P. , STATISTICS Methods and Applications, 1st edition, StatSoft, Tulsa, OK, 2007.
Sauban M. , and Pfahringer B. , "Text Categorization Using Document Profiling," The 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2003) – Conference Proceedings, Cavtat-Dubrovnik, Croatia, September 22-26, pp. 411-422, 2003.
Sebastiani F. ,"Machine learning in automated text categorization," Journal of ACM Computing Surveys (CSUR), vol. 34 , no. 1, pp. 1-47, 2002.
Yang Y. , Slattery S. , and Ghani R. , "A Study of approaches to hypertext Categorization," Journal of Intelligent Information Systems, vol. 18, no. 2-3, pp. 219-241, 2002.
Al-Shalabi R. , Kannan G. , and Gharaibeh H. , "Arabic text categorization using K-NN algorithm," The 4th International Multiconference on Computer and Information Technology (CSIT 2006) – Conference Proceedings, Amman, Jordan, 2006.
El-Halees A. , "A Comparative Study on Arabic Text Classification," Egyptian Computer Science Journal, vol. 30 , no. 2, 2008.
Yang Y. , "An Evaluation of Statistical Approaches to Text Categorization," Journal of Information Retrieval, vol. 1 , no. 1-2, pp. 69-90, 1999.
El-Kourdi M. , Bensaid A. , and Rachidi T. , "Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm," The 20th international conference on Computational Linguistics – Conference Proceedings, Geneva, August, 2004.
Lewis D. , "Naïve (Bayes) at forty: The Independent Assumption in Information Retrieval," The 10th European Conference on Machine Learning (ECML 1998) – Conference Proceedings, Berlin, pp. 4–15, 1998.
Feldman R. , and Sanger J. , The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, 2007.
Joachims T. , "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," The 10th European Conference on Machine Learning (ECML 1998) – Conference Proceedings, London, UK, pp. 137-142, 1998.
Apte C. , Damerau F. , and Weiss S. , "Text mining with decision rules and decision trees," The Conference on Automated Learning and Discovery (CONALD 1998) – Conference Proceedings, Pittsburgh, USA, June, 1998.
Saad M. , and Ashour W. , "Arabic Text Classification Using Decision Trees," The 12th international workshop on computer science and information technologies (CSIT 2010) – Conference Proceedings, Moscow, Saint-Petersburg, Russia, vol. 2, pp. 75-79, 2010.
Lianga S. , Liua Y. , Wang C. , and Jiana L. , "CUKNN: A parallel Implementation of k-Nearest Neighbor on Cuda-Enabled GPU," The 2009 IEEE Youth Conference on Information, Computing and Telecommunication (ICT2009) – Conference Proceedings, pp. 415-418, 2009.
Manning D. , Raghavan P. , and Schütze H. , An introduction to information retrieval, Cambridge, England: Cambridge University Press, 2006.
Grama A. , Gupta A. , Karypis G. , and Kumar V. , Introduction to Parallel Computing, 2nd edition, Addison Wesley, 2003.
Duwairi R. , Al-Refai M. , Khasawneh N. , "Feature reduction techniques for Arabic text categorization," Journal of the American Society for Information Science, vol. 60, no. 11, pp. 2347-2352, 2009.
Guan J. , and Zhou S. , "Pruning training corpus to speed up text classification," The 13th International Conference on Database and Expert Systems Applications (DEXA 2002) – Conference Proceedings, Aix-en-Provence, France, September, vol. 2453, pp. 831-840, 2002.
Buana P. , Jannet S. , and Putra l. , "Combination of K-Nearest Neighbor and K-Means based on Term Re-weighting for Classify Indonesian News," International Journal of Computer Applications, vol. 50, no. 11, pp. 37-42, 2012.
Ruoming J. , Yang G. , and Agrawal G. , "Shared memory parallelization of data mining algorithms: Techniques, programming interface and performance," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no . 1, pp. 71-89, 2005.
Tekiner F. , Tsuruoka Y. , Tsujii J. , and Ananiadou S. , "Highly Scalable Text Mining – Parallel Tagging Application," The 5th International Conference on Soft Computing, Computing with Words and Perceptions in System Analysis, Decision and Control (ICSCCW 2009) – Conference Proceedings, September, pp. 1-4, 2009.
Han J. , and Kamber M. , Data Mining: Concepts and Techniques, 2nd edition. The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor, 2006.
Nishida K. , "Learning and Detecting Concept Drift," Ph. D. Dissertation, Department of Information Science and Technology, Hokkaido University, 2008.
Khoja S. , and Garside R. , "Stemming Arabic text," Computer Science Department, Lancaster University, Lancaster, UK, 1999.
Larkey L. , Ballesteros L. , and Connell M. , "Light Stemming for Arabic Information Retrieval," Arabic Computational Morphology, book chapter, Springer, 2007.
Jing L. , Huang H. , and Shi H. , "Improved feature selection approach TFIDF in text mining," The 1st International Conference of machine learning and cybernetics – Conference Proceedings, Beijing, 2002.
Said D. , Wanas N. , Darwish N. , and Hegazy N. , "A Study of Arabic Text preprocessing methods for Text Categorization," The 2nd International Conference of on Arabic Language Resources and Tools – Conference Proceedings, Cairo, Egypt, 2009.
Salton G. , and Buckley C. , "A Study of Arabic Text preprocessing methods for Text Categorization," The Conference of information processing & management – Conference Proceedings, vol. 24, no. 5, pp. 513-523, 1998.
Saad M. , and Ashour W. , "OSAC: Open Source Arabic Corpus," The 6th International Conference on Electrical and Electronics Engineering and Computer Science (EEECS 2010) – Conference Proceedings, European University of Lefke, Cyprus, November 25-26, pp. 1-6, 2010.
Saad M. , "Open Source Arabic Language and Text Mining Tools," (2010, August), [Online], Available: http://sourceforge. net/projects/ar-text-mining [10 August 2012], 2010.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Arabic text classification k-NN algorithm parallel classifier multicomputer cluster