Lightweight Fingerprints for Fast Approximate Keyword Matching Using Bitwise Operations

Aleksander Cisłak

Lodz University of Technology Institute of Applied Computer Science Al. Politechniki 11, 90--924 Łódź, Poland
Szymon Grabowski

Lodz University of Technology Institute of Applied Computer Science Al. Politechniki 11, 90--924 Łódź, Poland

Lightweight Fingerprints for Fast Approximate Keyword Matching Using Bitwise Operations

keywords: Fingerprint, keyword matching, approximate matching, bitwise

We aim to speed up approximate keyword matching with the use of a lightweight, fixed-size block of data for each string, called a fingerprint. These work in a similar way to hash values; however, they can be also used for matching with errors. They store information regarding symbol occurrences using individual bits, and they can be compared against each other with a constant number of bitwise operations. In this way, certain strings can be deduced to be at least within the distance k from each other (using Hamming or Levenshtein distance) without performing an explicit verification. We show experimentally that for a preprocessed collection of strings, fingerprints can provide substantial speedups for k = 1, namely over 2.5 times for the Hamming distance and over 30 times for the Levenshtein distance. Tests were conducted on synthetic and real-world English and URL data.

mathematics subject classification 2000: 68W32

reference: Vol. 38, 2019, No. 2, pp. 367–389

doi: 10.31577/cai_2019_2_367

Computing and Informatics

formerly Computers and Artificial Intelligence

Lightweight Fingerprints for Fast Approximate Keyword Matching Using Bitwise Operations