Research Article

An Unsupervised Model to detect Web Spam based on Qualified Link Analysis and Language Models

by  Shrijina Sreenivasan, B. Lakshmipathi
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 63 - Issue 4
Published: February 2013
Authors: Shrijina Sreenivasan, B. Lakshmipathi
10.5120/10455-5163
PDF

Shrijina Sreenivasan, B. Lakshmipathi . An Unsupervised Model to detect Web Spam based on Qualified Link Analysis and Language Models. International Journal of Computer Applications. 63, 4 (February 2013), 33-37. DOI=10.5120/10455-5163

                        @article{ 10.5120/10455-5163,
                        author  = { Shrijina Sreenivasan,B. Lakshmipathi },
                        title   = { An Unsupervised Model to detect Web Spam based on Qualified Link Analysis and Language Models },
                        journal = { International Journal of Computer Applications },
                        year    = { 2013 },
                        volume  = { 63 },
                        number  = { 4 },
                        pages   = { 33-37 },
                        doi     = { 10.5120/10455-5163 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2013
                        %A Shrijina Sreenivasan
                        %A B. Lakshmipathi
                        %T An Unsupervised Model to detect Web Spam based on Qualified Link Analysis and Language Models%T 
                        %J International Journal of Computer Applications
                        %V 63
                        %N 4
                        %P 33-37
                        %R 10.5120/10455-5163
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

With the massive use of the internet and the search engines, a major problem that comes to light is the Web Spam. Web spam can be detected by analyzing the various features of web pages and categorizing them as belonging to the spam or non-spam category. The proposed work considers unsupervised learning algorithms to characterize the web pages based on the link based features and content based features to compare the difference between the various sources of information in the source and target page. An unsupervised learning technique that is initially considered is the Hidden Markov Model which captures the different browsing patterns of users. Users may not only access the web through direct hyperlinks but may also jump from one page to another by typing URL's or even by opening multiple windows. The unsupervised techniques have no previous class definitions to map outcomes to. As a result, they find out all possible probabilities of relation between the source and target page. This helps to attain higher efficiency in the detection of web spam even if the dataset used is small. Other unsupervised methods used to implement the same are the Self Organizing Map (SOM) and the Adaptive Resonance Theory (ART). Finally a performance evaluation of all the techniques used is made and represented in the increasing order of their performance metric.

References
  • J. Abernethy, O. Chapelle, and C. Castillo, "Webspam identification through content and hyperlinks," in Proc. Fourth Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Beijing, China, 2008, pp. 41–44
  • L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, "Link-based characterization and detection of web spam," in Proc. 2nd Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06), Seattle, WA, 2006, pp. 1–8.
  • A. A. Benczúr, I. Bíró, K. Csalogány, and M. Uher, "Detecting nepotistic links by language model disagreement," in Proc. 15th Int. Conf. World Wide Web (WWW'06), New York, 2006, pp. 939–940, ACM.
  • A. A. Benczúr, K. Csalogány, T. Sarlós, and M. Uher, "Spamrank Fully automatic link spam detection," in Proc. First Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb, Chiba, Japan, 2005, pp. 25–38
  • Alexandros Ntoulas et al. , "Detecting Spam Web Pages through Content Analysis"
  • C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna, "A reference collection for web spam," SIGIR Forum, vol. 40, no. 2, pp. 11–24.
  • C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri, "Know your neighbors: Web spam detection using the web topology," in Proc. 30th Annu. Int. ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR'07), New York, 2007, pp. 423–430, ACM.
  • Lourdes Araujo and Juan Martinez-Romo, "Web Spam Detection: New classification Features Based on Qualified Link Analysis and Language"
  • B. Davison, Recognizing Nepotistic Links on the Web 2000[Online]. Available: http://citeseer. ist. psu. edu/davison00recognizing. html
  • N. Craswell, D. Hawking, and S. Robertson, "Effective site finding using link anchor information," in Proc. 24th Annu. Int. ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR'01), New York, 2001, pp. 250–257, ACM.
  • N. Eiron and K. S. McCurley, "Analysis of anchor text for web search," in Proc. 26th Annu. Int. ACM SIGIR Conf. Research and Development in Informaion Retrieval (SIGIR'03), New York, 2003, pp. 459–460
  • S N Sivanandam, S Sumathi, S N Deepa, "Introduction to Neural Networks using Matlab 6. 0"
  • Spamdexing, http://en. wikipedia. org/wiki/Spamdexing
  • Hidden Markov Model Features, http://en. wikipedia. org/wiki/Hidden_Markov_model
  • Self Organizing Map: http://en. wikipedia. org/wiki/Self-organizing_map
  • Self Organizing Maps architecture and definition: http://users. ics. aalto. fi/jhollmen/dippa/node9. html
  • Adaptive Resonance Theory concepts: http://en. wikipedia. org/wiki/Adaptive_resonance_theory
  • Zolt´an Gy¨ongyi and Hector Garcia-Molina, "Web spam Taxonomy" http://ilpubs. stanford. edu:8090/771/1/2005-9. pdf
  • Performance measures using sensitivity and specificity, http://en. wikipedia. org/wiki/Sensitivity_and_specificity
  • The Ranking of pages via search engines: http://en. wikipedia. org/wiki/PageRank
  • The concept, terms and definitions of a Language Model, http://en. wikipedia. org/wiki/Language_model
  • Features of various measures like the true positive, false positive rate http://en. wikipedia. org/wiki/Type_I_and_type_II_errors
  • Precision, Recall and F-measure: http://en. wikipedia. org/wiki/Precision_and_recall
  • Erol Sahin, "Neurocomputing. Adaptive Resonance Theory"http://www. kovan. ceng. metu. edu. tr/~erol/Courses/CENG569/slides/ceng569-2005-2006-w6. pdf
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

Link analysis Unsupervised Learning Techniques Web spam Detection

Powered by PhDFocusTM