Distributive Join Strategy Based on Tuple Inversion

keywords: Distributed hash table (DHT), distributed join, inverted file
In this paper, we propose a new direction for distributive join operations. We assume that there will be a scalable distributed computer system in which many computers (processors) are connected through a communication network that can be in a LAN or as part of the Internet with sufficient bandwidth. A relational database is then distributed across this network of processors. However, in our approach, the distribution of the database is very fine-grained and is based on the Distributed Hash Table (DHT) concept. A tuple of a table is assigned to a specific processor by using a fair hash function applied to its key value. For each joinable attribute, an inverted file list is further generated and distributed again based on the DHT. This pre-distribution is done when the tuple enters the system and therefore does not require any distribution of data tuples on the fly when the join is executed. When a join operation request is broadcast, each processor performs a local join and the results are sent back to a query processor which, in turn, merges the join results and returns them to the user. Note that the distribution of the DHT of the inverted file lists can be either preprocessed or distributed on the fly. If the lists are preprocessed and distributed, they have to be maintained. We evaluate our approach by comparing it empirically to two other approaches: the naive join method and the fully distributed join method. The results show a significantly higher performance of our method for a wide range of possible parameters.
reference: Vol. 24, 2005, No. 4, pp. 391–413