Effect of Various Kernels and Feature Selection Methods on SVM Performance for Detecting Email Spams

Shrawan Kumar Trivedi; Shubhamoy Dey

Research Article

Effect of Various Kernels and Feature Selection Methods on SVM Performance for Detecting Email Spams

by Shrawan Kumar Trivedi, Shubhamoy Dey

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 66 - Issue 21

Published: March 2013

Authors: Shrawan Kumar Trivedi, Shubhamoy Dey

10.5120/11240-6433

PDF

Shrawan Kumar Trivedi, Shubhamoy Dey . Effect of Various Kernels and Feature Selection Methods on SVM Performance for Detecting Email Spams. International Journal of Computer Applications. 66, 21 (March 2013), 18-23. DOI=10.5120/11240-6433

                        @article{ 10.5120/11240-6433,
                        author  = { Shrawan Kumar Trivedi,Shubhamoy Dey },
                        title   = { Effect of Various Kernels and Feature Selection Methods on SVM Performance for Detecting Email Spams },
                        journal = { International Journal of Computer Applications },
                        year    = { 2013 },
                        volume  = { 66 },
                        number  = { 21 },
                        pages   = { 18-23 },
                        doi     = { 10.5120/11240-6433 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2013
                        %A Shrawan Kumar Trivedi
                        %A Shubhamoy Dey
                        %T Effect of Various Kernels and Feature Selection Methods on SVM Performance for Detecting Email Spams%T 
                        %J International Journal of Computer Applications
                        %V 66
                        %N 21
                        %P 18-23
                        %R 10.5120/11240-6433
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

This Research presents the effects of interaction between various Kernel functions and different Feature Selection Techniques for improving the learning capability of Support Vector Machine (SVM) in detecting email spams. The interaction of four Kernel functions of SVM i. e. "Normalised Polynomial Kernel (NP)", "Polynomial Kernel (PK)", "Radial Basis Function Kernel (RBF)", and "Pearson VII Function-Based Universal Kernel (PUK)" with three feature selection techniques i. e. "Gain Ratio ( )", "Chi-Squared ( ), and "Latent Semantic Indexing ( )" have been tested on the "Enron Email Data Set". The results reveal some interesting facts regarding the variation of the performance of Kernel functions with the number of features (or dimensions) in the data. NP performs the best across a wide range of dimensionality, for all the feature selection techniques tested. PUK kernel works well with low dimensional data and is the second best in performance (after NP), but shows poor performance for high dimensional data. Latent Semantic Indexing (LSI) appears to be the best amongst all the tested feature selection techniques. However, for high dimensional data, all the feature selection techniques perform almost equally well.

References

Trivedi, S. , Dey, S. , 2013, "Interplay between Probabilistic Classifiers and Boosting Algorithms for Detecting complex unsolicited Emails," selected in International Conference of Information and Network Security, Bangkok, Thailand (ICINS 2013) and for publication in Journal of Advances in Computer Networks (JACN) ISSN: 1793-8244.
Lai, C. C. 2007, "An empirical study of three machine learning methods for spam filtering,"Journal of Knowledge-Based Systems archive, Volume 20, Issue 3, PP. 249-254.
Goodman, J. , Cormack, G. V. , and Heckerman, D. , 2007, "Spam and the ongoing battle for the inbox," Communications of the ACM, vol. 50, issue 2, pp. 24-33.
Vapnik, V. N. , 1999. "An Overview of Statistical Learning Theory", IEEE Trans. on Neural Network, Vol. 10, No. 5, pp. 988-998.
Drucker, H. , Wu, D. , and Vapnik, V. N. , 1999 "Support Vector Machines for Spam Categorization," IEEE Transaction of Neural Networks, Vol. 10, No. 5.
Woitaszek, M. , Shaaban, M. , and Czernikowski, R. , 2003, "Identifying Junk Electronic Mail in Microsoft Outlook with a Support Vector Machine," conf.
Proceedings, 2003 Symposium on Applications and the Internet, PP 166 – 169, 27-31.
Zhang, L. , Zhu, J. , and Yao, T. , 2004, "An Evaluation of Statistical Spam Filtering Techniques," Journal ACM, Transactions on Asian Language Information Processing (TALIP), PP 243 – 269, Volume 3 Issue 4.
Cheng, V. , Li, C. H. , 2006, "Personalized Spam Filtering with Semi-supervised Classifier Ensemble," WI '06 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, ISBN: 0-7695-2747-7, Pages 195-201.
Sculley, D. , Wachman, G. M. , "Relaxed Online SVMs for Spam Filtering" SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, PP 415-422, ISBN: 978-1-59593-597-7, July 2007.
Chang, M. , Yih, W. , and Meek, C. , 2004, "Partitioned Logistic Regression for Spam Filtering," KDD '08 Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, PP 97-105 ISBN: 978-1-60558-193-4.
Awad, W. A. , and ELseuofi, S. M. , 2012 "Machine Learning Methods for Spam Classification," International Journal of Computer Science & Information Technology (IJCSIT), PP 173-184, Vol 3, No 1.
Kumar, R. K. , Poonkuzhali, G. , Sudhakar, P. , 2012, "Comparative Study on Email Spam Classifier using Data Mining Techniques," Proceedings of International Multi Conference on Engineers and Computer Scientist (IMECS) , Hong Kong, Vol. 1, ISBN: 978-988-19251-1-4.
Vapnik, V. , 1995, "The Nature of Statistical Learning Theory", Springer, AT&T Bell Labs, Holmdel, NJ.
Hall, M. A. , and Smith, L. A. , 1998, "Practical feature subset selection for machine learning", Proceedings of the 21st Australian Computer Science Conference, pp 181–191.
Liu, H. , and Setiono, R. , 1995, "Chi2: Feature selection and discretization of numeric attributes", Proc. IEEE 7th International Conference on Tools with Artificial Intelligence pp, 338-391.
Deerwester, S. , Dumais, S. T. , Furnas, G. W. , and. Landauer, T. K. , 1990 "Indexing by latent semantic analysis," J. Amer. Soc. Inform. Sci , pp 391–407.
Dumais, S. T. , 1995, "Using LSI for information filtering," In: Harman, D. , (Eds. ), The Third Text REtrieval Conference (TREC3). National Institute of Standards and Technology Special Publications 500-215, pp. 219-230,
Story, R. E. , 1996, "An explanation of the effectiveness of latent semantic indexing by means of a Bayesian regression model," Inform. Process. Manage. Vol 32, pp 329–344.
Aas, K. and Eikvil L, 1999, "Text categorisation: A survey, Technical report," Norwegian Computing Centre.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Support Vector Machine (SVM) Kernel functions Feature selection methods