An Unsupervised Model to detect Web Spam based on Qualified Link Analysis and Language Models

Shrijina Sreenivasan; B. Lakshmipathi

Research Article

An Unsupervised Model to detect Web Spam based on Qualified Link Analysis and Language Models

by Shrijina Sreenivasan, B. Lakshmipathi

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 63 - Issue 4

Published: February 2013

Authors: Shrijina Sreenivasan, B. Lakshmipathi

10.5120/10455-5163

PDF

Shrijina Sreenivasan, B. Lakshmipathi . An Unsupervised Model to detect Web Spam based on Qualified Link Analysis and Language Models. International Journal of Computer Applications. 63, 4 (February 2013), 33-37. DOI=10.5120/10455-5163

                        @article{ 10.5120/10455-5163,
                        author  = { Shrijina Sreenivasan,B. Lakshmipathi },
                        title   = { An Unsupervised Model to detect Web Spam based on Qualified Link Analysis and Language Models },
                        journal = { International Journal of Computer Applications },
                        year    = { 2013 },
                        volume  = { 63 },
                        number  = { 4 },
                        pages   = { 33-37 },
                        doi     = { 10.5120/10455-5163 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2013
                        %A Shrijina Sreenivasan
                        %A B. Lakshmipathi
                        %T An Unsupervised Model to detect Web Spam based on Qualified Link Analysis and Language Models%T 
                        %J International Journal of Computer Applications
                        %V 63
                        %N 4
                        %P 33-37
                        %R 10.5120/10455-5163
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

With the massive use of the internet and the search engines, a major problem that comes to light is the Web Spam. Web spam can be detected by analyzing the various features of web pages and categorizing them as belonging to the spam or non-spam category. The proposed work considers unsupervised learning algorithms to characterize the web pages based on the link based features and content based features to compare the difference between the various sources of information in the source and target page. An unsupervised learning technique that is initially considered is the Hidden Markov Model which captures the different browsing patterns of users. Users may not only access the web through direct hyperlinks but may also jump from one page to another by typing URL's or even by opening multiple windows. The unsupervised techniques have no previous class definitions to map outcomes to. As a result, they find out all possible probabilities of relation between the source and target page. This helps to attain higher efficiency in the detection of web spam even if the dataset used is small. Other unsupervised methods used to implement the same are the Self Organizing Map (SOM) and the Adaptive Resonance Theory (ART). Finally a performance evaluation of all the techniques used is made and represented in the increasing order of their performance metric.

References

J. Abernethy, O. Chapelle, and C. Castillo, "Webspam identification through content and hyperlinks," in Proc. Fourth Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Beijing, China, 2008, pp. 41–44
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, "Link-based characterization and detection of web spam," in Proc. 2nd Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06), Seattle, WA, 2006, pp. 1–8.
A. A. Benczúr, I. Bíró, K. Csalogány, and M. Uher, "Detecting nepotistic links by language model disagreement," in Proc. 15th Int. Conf. World Wide Web (WWW'06), New York, 2006, pp. 939–940, ACM.
A. A. Benczúr, K. Csalogány, T. Sarlós, and M. Uher, "Spamrank Fully automatic link spam detection," in Proc. First Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb, Chiba, Japan, 2005, pp. 25–38
Alexandros Ntoulas et al. , "Detecting Spam Web Pages through Content Analysis"
C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna, "A reference collection for web spam," SIGIR Forum, vol. 40, no. 2, pp. 11–24.
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri, "Know your neighbors: Web spam detection using the web topology," in Proc. 30th Annu. Int. ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR'07), New York, 2007, pp. 423–430, ACM.
Lourdes Araujo and Juan Martinez-Romo, "Web Spam Detection: New classification Features Based on Qualified Link Analysis and Language"
B. Davison, Recognizing Nepotistic Links on the Web 2000[Online]. Available: http://citeseer. ist. psu. edu/davison00recognizing. html
N. Craswell, D. Hawking, and S. Robertson, "Effective site finding using link anchor information," in Proc. 24th Annu. Int. ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR'01), New York, 2001, pp. 250–257, ACM.
N. Eiron and K. S. McCurley, "Analysis of anchor text for web search," in Proc. 26th Annu. Int. ACM SIGIR Conf. Research and Development in Informaion Retrieval (SIGIR'03), New York, 2003, pp. 459–460
S N Sivanandam, S Sumathi, S N Deepa, "Introduction to Neural Networks using Matlab 6. 0"
Spamdexing, http://en. wikipedia. org/wiki/Spamdexing
Hidden Markov Model Features, http://en. wikipedia. org/wiki/Hidden_Markov_model
Self Organizing Map: http://en. wikipedia. org/wiki/Self-organizing_map
Self Organizing Maps architecture and definition: http://users. ics. aalto. fi/jhollmen/dippa/node9. html
Adaptive Resonance Theory concepts: http://en. wikipedia. org/wiki/Adaptive_resonance_theory
Zolt´an Gy¨ongyi and Hector Garcia-Molina, "Web spam Taxonomy" http://ilpubs. stanford. edu:8090/771/1/2005-9. pdf
Performance measures using sensitivity and specificity, http://en. wikipedia. org/wiki/Sensitivity_and_specificity
The Ranking of pages via search engines: http://en. wikipedia. org/wiki/PageRank
The concept, terms and definitions of a Language Model, http://en. wikipedia. org/wiki/Language_model
Features of various measures like the true positive, false positive rate http://en. wikipedia. org/wiki/Type_I_and_type_II_errors
Precision, Recall and F-measure: http://en. wikipedia. org/wiki/Precision_and_recall
Erol Sahin, "Neurocomputing. Adaptive Resonance Theory"http://www. kovan. ceng. metu. edu. tr/~erol/Courses/CENG569/slides/ceng569-2005-2006-w6. pdf

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Link analysis Unsupervised Learning Techniques Web spam Detection