Parallel Near-Duplicate Document Detection Using General-Purpose GPU

Dimitar Peshevski

Faculty of Computer Science and Engineering Ss. Cyril and Methodius University Rugjer Boshkovikj 16 1020, Skopje, North Macedonia
Vladimir Zdraveski

Faculty of Computer Science and Engineering Ss. Cyril and Methodius University Rugjer Boshkovikj 16 1020, Skopje, North Macedonia
Sashko Ristov

Department of Computer Science University of Innsbruck Technikerstraße 21a A - 6020, Innsbruck, Austria

Parallel Near-Duplicate Document Detection Using General-Purpose GPU

keywords: Near-duplicate, document, Shingling, similarity, locality-sensitive hashing, MinHash, fingerprint, parallelism, GPU, CUDA

In today's data-rich world, one of the most significant challenges is efficiently identifying near-duplicate data, especially when integrating data from various sources. Identifying near-duplicate documents applies to any content and has been widely used to enhance the efficiency of search engines, identify plagiarism or spam, and so on. Even smaller or specialized search engines can benefit from knowledge about near-duplicate documents. Shingling and MinHash are two state-of-the-art approaches to detecting near-duplicate documents. However, there are not many attempts to improve the performance of this locality-sensitive hashing technique. In this research paper, we propose a parallel implementation of the MinHash algorithm for near-duplicate document detection utilizing the immense parallelism offered by general-purpose GPUs. Experimental results show that the GPU-based parallel solution is far more cost-effective than the CPU-based sequential and parallel solutions.

mathematics subject classification 2000: 68W10

reference: Vol. 43, 2024, No. 3, pp. 583–610

doi: 10.31577/cai_2024_3_583

Computing and Informatics

formerly Computers and Artificial Intelligence

Parallel Near-Duplicate Document Detection Using General-Purpose GPU