Evaluation and Implementation of n-Gram-Based Algorithm for Fast Text Comparison

Maciej Wielgosz

AGH University of Science and Technology Mickiewicza 30 Av., 30-059 Krakow, Poland
Paweł Szczepka

AGH University of Science and Technology Mickiewicza 30 Av., 30-059 Krakow, Poland
Paweł Russek

AGH University of Science and Technology Mickiewicza 30 Av., 30-059 Krakow, Poland
Ernest Jamro

AGH University of Science and Technology Mickiewicza 30 Av., 30-059 Krakow, Poland
Kazimierz Wiatr

AGH University of Science and Technology Mickiewicza 30 Av., 30-059 Krakow, Poland
Marcin Pietroń

ACC Cyfronet AGH Nawojki 11, 30-950 Krakow, Poland
Dominik Żurek

ACC Cyfronet AGH Nawojki 11, 30-950 Krakow, Poland

Evaluation and Implementation of n-Gram-Based Algorithm for Fast Text Comparison

keywords: Text similarity analysis, n-gram-based model, GPGPU implementation, multi-CPU implementation

This paper presents a study of an n-gram-based document comparison method. The method is intended to build a large-scale plagiarism detection system. The work focuses not only on an efficiency of the text similarity extraction but also on the execution performance of the implemented algorithms. We took notice of detection performance, storage requirements and execution time of the proposed approach. The obtained results show the trade-offs between detection quality and computational requirements. The GPGPU and multi-CPU platforms were considered to implement the algorithms and to achieve good execution speed. The method consists of two main algorithms: a document's feature extraction and fast text comparison. The winnowing algorithm is used to generate a compressed representation of the analyzed documents. The authors designed and implemented a dedicated test framework for the algorithm. That allowed for the tuning, evaluation, and optimization of the parameters. Well-known metrics (e.g. precision, recall) were used to evaluate detection performance. The authors conducted the tests to determine the performance of the winnowing algorithm for obfuscated and unobfuscated texts for a different window and n-gram size. Also, a simplified version of the text comparison algorithm was proposed and evaluated to reduce the computational complexity of the text comparison process. The paper also presents GPGPU and multi-CPU implementations of the algorithms for different data structures. The implementation speed was tested for different algorithms' parameters and the size of data. The scalability of the algorithm on multi-CPU platforms was verified. The authors of the paper provide the repository of software tools and programs used to perform the conducted experiments.

reference: Vol. 36, 2017, No. 4, pp. 887–907

doi: 10.4149/cai_2017_4_887

Computing and Informatics

formerly Computers and Artificial Intelligence

Evaluation and Implementation of n-Gram-Based Algorithm for Fast Text Comparison