A Hybrid Algorithm for Duplicate Document Detection

Ashish Kumar; Arun Solanki

A Hybrid Algorithm for Duplicate Document Detection

Ashish Kumar, Arun Solanki

Abstract

Identification of duplicate document in a set of documents is a very big issue in information retrieval. In recent years, there are many researches going on and many methods have been proposed to detect and remove the duplicate documents but their relevance is still an issue. This paper proposed a hybrid algorithm based on word position by integrating n-gram searching technique. This experiment also uses inverted index to reduce the time complexity and space complexity and for fast searching. The result also shows a set of documents with their ranking thus helping the user not to waste time on getting the unwanted document.

Cite this Article
Ashish Kumar, Arun Solanki. A Hybrid Algorithm for Duplicate Document Detection. Journal of Advanced Database Management & Systems. 2015; 2(2): 24–34p.

Keywords

Duplicate document, inverted index, N-gram, document relevance

Full Text:

PDF

References

Gaudence Uwamahoro, Zhang Zuping. Efficient Algorithm for Near Duplicate Documents Detection. International Journal Of Computer Science Issues. Mar 2013; 10(2).

Justin Zobel, Alistair Moffat, Ron Sacks-Davis. An Efficient Indexing Technique for Full Text Databases. 1992; 352–362p.

Zobel J, Moffat. Inverted Files for Text Search Engines. ACM Computing Surveys. 2006; 38(2): 1–55p.

Ajik Kumar Mahapatra, Sitanath Biswas. Inverted Index Techniques. International Journal of Computer Science Issues. 2011; 8(4–1).

Charikar MS. Similarity Estimation Techniques from Rounding Algorithms. In Proceedings of 34th Annual ACM Symposium on Theory of Computing, (Montreal, Quebec, Canada. 2002; 380–388p.

Metzler D, Bernstein Y, Croft WB, et al. Similarity Measures for Tracking Information Flow. In: The 14th ACM Conference on Information and Knowledge Management (CIKM 2005). 2005; 517–524p.

Heintze N. Scalable Document Fingerprinting. In Proc. Of the 2nd USENIX Workshop on Electronic Commerce. 1996; 216: 191–200p.

Xiao C, Wang W, Lin X, et al. Efficient Similarity Join for Near Duplicate Detection. Beijing, China, 2008.

Poonkuzhali G, et al. International Journal of Engineering Science and Technology. 2010; 2(9): 4026–4032p.

Li X, Croft B. Novelty Detection Based on Sentence Level Patterns. In: Proc. CIKM-2005. ACM Conf. on Information and Knowledge Management, ACM Press, New York. 2005.

Lakkaraju P, Gauch S, Speretta M. Document Similarity Based on Concept Tree Distance. Proceedings of Nineteeth ACM conference on Hypertext and Hypermedia. Pitteburgh, PA, USA. 2008; 127–132p.

Junxi An, Pengsen Chen. The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance. Journal of Software. Jul 2013; 8(7): 1666–1670p.

Yung-Shen Lin, Ting-Yi Liao, Shie-Jue Lee. Detecting Near-Duplicate Documents using Sentence-Level Features and Supervised Learning. Expert. Syst. Appl. 2012; 40(5): 1467–1476p.

Karthikeyan B, Vaithiyanathan V, Lavanya CV. Similarity Detection in Source Code Using Data Mining Techniques. Eur. J. Sci. Res. ISSN 1450-216X. 2011; 62(4): 500–505p.

Suresh Subramanian, Sivaprakasam. Efficient Algorithm for Removing Duplicate Documents. International Journal of Soft Computing and Engineering. ISSN: 2231-2307. Jan 2014; 3(6).

Callan James P. Proximity Scoring Using Sentence-Based Inverted Index for Practical Full-Text Search. 2008.

Yerra R, Yiu Kai NG. A Sentence-Based Copy Detection Approach for Web Documents. Lecture Notes in Computer Science. Springer Berlin/Heidelberg. 2005; 3613: 557–570p.

Allan J, Wade C, Bolivar A. Retrieval and Novelty Detection at the Sentence Level. In: Proc. SIGIR-2003, the 26th ACM Conference on Research and Development in Information Retrieval, Toronto, Canada, ACMPress, New York. 2003; 314–323p.

Refbacks

There are currently no refbacks.

This site has been shifted to https://stmcomputers.stmjournals.com/

Username
Password
Remember me