A Hybrid Algorithm for Duplicate Document Detection

Ashish Kumar, Arun Solanki


Identification of duplicate document in a set of documents is a very big issue in information retrieval. In recent years, there are many researches going on and many methods have been proposed to detect and remove the duplicate documents but their relevance is still an issue. This paper proposed a hybrid algorithm based on word position by integrating n-gram searching technique. This experiment also uses inverted index to reduce the time complexity and space complexity and for fast searching. The result also shows a set of documents with their ranking thus helping the user not to waste time on getting the unwanted document.


Duplicate document, inverted index, N-gram, document relevance

