Nearest Neighbor Clustering over Partitioned Data
keywords: Clustering algorithm, mobile and static agents, data privacy, vertically and horizontally partitioned data
Most clustering algorithms assume that all the relevant data are available on a single node of a computer network. In the emerging distributed and networked knowledge environments, databases relevant for computations may reside on a number of nodes connected by a communication network. These data resources cannot be moved to other network sites due to privacy, security, and size considerations. The desired global computation must be decomposed into local computations to match the distribution of data across the network. The capability to decompose computations must be general enough to handle different distributions of data and different participating nodes in each instance of the global computation. In this paper, we present a methodology and algorithm for clustering distributed data in d-dimensional space, using nearest neighbor clustering, wherein each distributed data source is represented by an agent. Each such agent has the capability to decompose global computations into local parts, for itself and for agents at other sites. The global computation is then performed by the agent either exchanging some minimal summaries with other agents or traveling to all the sites and performing local tasks that can be done at each local site. The objective is to perform global tasks with a minimum of communication or travel by participating agents across the network.
reference: Vol. 30, 2011, No. 5, pp. 1011–1036