New Algorithm for Clustering Distributed Data Using mathbf k-Means

Ahmed M. Khedr

Computer Science Department, Faculty of Sciences
University of Sharjah, Sharjah, UAE
&
Mathematics Department, Faculty of Science
Zagazig University, Zagazig, Egypt
Raj K. Bhatnagar

School of Electronic and Computing Systems
University of Cincinnati, Cincinnati, 45221, USA

New Algorithm for Clustering Distributed Data Using mathbf k-Means

keywords: Data privacy, decomposable algorithm, k-means clustering, vertically and horizontally distributed data

The internet era and high speed networks have ushered in the capabilities to have ready access to large amounts of geographically distributed data. Individuals, businesses, and governments recognize the value of this available resource to those who can transform the data into information. These databases, though valuable as individual entities, become significantly more valuable when they function as parts of a federated database and their data can be aggregated for collective mining or computations. This requires new algorithms to shift their focus from working with single databases to efficiently working with federated databases. In this paper, we propose a new decomposable version of the popular k-means clustering algorithm that works in this desired manner with a set of networked databases. We show that it is possible to perform global computation in a reasonably secure manner for either horizontally or vertically distributed databases. The computation is completed by only exchanging a few local summaries among the databases. An empirical and analytical validation of our results is also presented.

mathematics subject classification 2000: 68U99

reference: Vol. 33, 2014, No. 4, pp. 943–964

Computing and Informatics

formerly Computers and Artificial Intelligence

New Algorithm for Clustering Distributed Data Using mathbf k-Means