Parallel Near-Duplicate Document Detection Using General-Purpose GPU

keywords: Near-duplicate, document, Shingling, similarity, locality-sensitive hashing, MinHash, fingerprint, parallelism, GPU, CUDA
In today's data-rich world, one of the most significant challenges is efficiently identifying near-duplicate data, especially when integrating data from various sources. Identifying near-duplicate documents applies to any content and has been widely used to enhance the efficiency of search engines, identify plagiarism or spam, and so on. Even smaller or specialized search engines can benefit from knowledge about near-duplicate documents. Shingling and MinHash are two state-of-the-art approaches to detecting near-duplicate documents. However, there are not many attempts to improve the performance of this locality-sensitive hashing technique. In this research paper, we propose a parallel implementation of the MinHash algorithm for near-duplicate document detection utilizing the immense parallelism offered by general-purpose GPUs. Experimental results show that the GPU-based parallel solution is far more cost-effective than the CPU-based sequential and parallel solutions.
mathematics subject classification 2000: 68W10
reference: Vol. 43, 2024, No. 3, pp. 583–610