International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
Volume 19 - Issue 7 |
Published: April 2011 |
Authors: Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan |
![]() |
Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan . A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix. International Journal of Computer Applications. 19, 7 (April 2011), 16-21. DOI=10.5120/2374-3128
@article{ 10.5120/2374-3128, author = { Midhun Mathew,Shine N Das,T R Lakshmi Narayanan,Pramod K Vijayaraghavan }, title = { A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix }, journal = { International Journal of Computer Applications }, year = { 2011 }, volume = { 19 }, number = { 7 }, pages = { 16-21 }, doi = { 10.5120/2374-3128 }, publisher = { Foundation of Computer Science (FCS), NY, USA } }
%0 Journal Article %D 2011 %A Midhun Mathew %A Shine N Das %A T R Lakshmi Narayanan %A Pramod K Vijayaraghavan %T A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix%T %J International Journal of Computer Applications %V 19 %N 7 %P 16-21 %R 10.5120/2374-3128 %I Foundation of Computer Science (FCS), NY, USA
The voluminous amount of web documents has weakened the performance and reliability of web search engines. The subsistence of near-duplicate data is an issue that accompanies the growing need to incorporate heterogeneous data. Web content mining face huge problems due to the existence of duplicate and near-duplicate web pages. These pages either increase the index storage space or increase the serving costs thereby irritating the users. Near-duplicate detection has been recognized as an important one in the field of plagiarism detection, spam detection and in focused web crawling scenarios. Here we propose a novel idea for finding near-duplicates of an input web-page, from a huge repository. We proposes a TDW matrix based algorithm with three phases, rendering, filtering and verification, which receives an input web-page and a threshold in its first phase , prefix filtering and positional filtering to reduce the size of records in the second phase and returns an optimal set of near-duplicate web pages in the verification phase after calculating its similarity. The experimental results show that our algorithm outperforms in terms of two benchmark measures, precision and recall, and a reduction in the size of competing record set.