A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix

Midhun Mathew; Shine N Das; T R Lakshmi Narayanan; Pramod K Vijayaraghavan

Research Article

A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix

by Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 19 - Issue 7

Published: April 2011

Authors: Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan

10.5120/2374-3128

PDF

Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan . A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix. International Journal of Computer Applications. 19, 7 (April 2011), 16-21. DOI=10.5120/2374-3128

                        @article{ 10.5120/2374-3128,
                        author  = { Midhun Mathew,Shine N Das,T R Lakshmi Narayanan,Pramod K Vijayaraghavan },
                        title   = { A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix },
                        journal = { International Journal of Computer Applications },
                        year    = { 2011 },
                        volume  = { 19 },
                        number  = { 7 },
                        pages   = { 16-21 },
                        doi     = { 10.5120/2374-3128 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2011
                        %A Midhun Mathew
                        %A Shine N Das
                        %A T R Lakshmi Narayanan
                        %A Pramod K Vijayaraghavan
                        %T A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix%T 
                        %J International Journal of Computer Applications
                        %V 19
                        %N 7
                        %P 16-21
                        %R 10.5120/2374-3128
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

The voluminous amount of web documents has weakened the performance and reliability of web search engines. The subsistence of near-duplicate data is an issue that accompanies the growing need to incorporate heterogeneous data. Web content mining face huge problems due to the existence of duplicate and near-duplicate web pages. These pages either increase the index storage space or increase the serving costs thereby irritating the users. Near-duplicate detection has been recognized as an important one in the field of plagiarism detection, spam detection and in focused web crawling scenarios. Here we propose a novel idea for finding near-duplicates of an input web-page, from a huge repository. We proposes a TDW matrix based algorithm with three phases, rendering, filtering and verification, which receives an input web-page and a threshold in its first phase , prefix filtering and positional filtering to reduce the size of records in the second phase and returns an optimal set of near-duplicate web pages in the verification phase after calculating its similarity. The experimental results show that our algorithm outperforms in terms of two benchmark measures, precision and recall, and a reduction in the size of competing record set.

References

Fetterly D, Manasse M, Najork M, On the evolution of clusters of near-duplicate Web pages, In Proceedings of the First Latin American Web Congress, pp.37- 45 Nov. 2003.
Chuan Xiao, Wei Wang, Xuemin Lin, Efficient Similarity Joins for Near-Duplicate Detection, Proceeding of the 17th international conference on World Wide Web, pp 131 – 140. April 2008.
Gurmeet Singh Manku, Arvind Jain and Anish Das Sarma, Detecting near-duplicates for web crawling, In Proceedings of the 16th international conference on World Wide Web, pp. 141 - 150, Banff, Alberta, Canada, 2007.
Dennis Fetterly, Mark Manasse and Marc Najork, Detecting phrase-level duplication on the world wide web, In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp.170 - 177, Salvador, Brazil, 2005.
D. Lowd and C. Meek, Good word attacks on statistical spam filters, Second Conference on Email and Anti-Spam, July 2005.
Shine N Das, Midhun Mathew, Pramod K.Vijayaraghavan, An Approach for Optimal Feature Subset Selection using a New Term Weighting Scheme and Mutual Information, Proceeding of the International Conference on Advanced Science, Engineering and Information Technology, Malaysia, 2011, pp 273-278, January 2011.
Broder, A., Glassman, S., Manasse, M., and Zweig G. Syntactic Clustering of the Web, In 6th International World Wide Web Conference, pp: 393-404, 1997.
Fetterly, D., Manasse, M. and Najork, M. On the evolution of clusters of near-duplicate web pages, In Proceedings of the first Latin AmericanWeb Congress (LAWeb), 37–45, 2003.
Yun Ling, Xiaobo Tao Hexin Lv, A Priority-Based Method Of Near-duplicated Text Information Of Web Pages Deletion, IEEE International Conference on Software Engineering and Service Sciences (ICSESS), August 2010.
V.A. Narayana, P. Premchand and A. Govardhan, Effective Detection of Near-Duplicate Web Documents in Web Crawling, International Journal of Computational Intelligence Research, Volume 5, Number 1, pp. 83–96, 2009.
Jody S. Hourigan and Lynn V. McIndoo, A scientific Report on Singular Value Decomposition, 1998
Shine N Das, K. V. Pramod, Relevancy based Re-ranking of Search Engine Result, Proceedings of International Conference on Mathematical Computing and Management, Kerala, India, June 2010.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Near-Duplicate Detection Term-Document-Weight Matrix Prefix filtering Positional filtering Singular Value Decomposition