Scalable Data Clustering with Partial Affinity Propagation on MapReduce

Overview

Data clustering is an invaluable process in modern information systems enabling a range of applications including query expansion, search result organization, and interactive searching in exploratory settings. However, major challenges associated with clustering efficiency and scalability have hindered broad adoption. Computational complexity is a particular challenge when applying clustering techniques on the web because of the web’s scale. In recent years a technique called Affinity Propagation (AP) has grown in popularity. AP improves error rates and clustering speed relative to previous solutions such as k-centers clustering, however at web-scale AP continues to suffer efficiency problems. To address the inefficiencies of existing clustering techniques a team of Drexel researchers, led by Dr. Weimao Ke, have developed a modified AP algorithm called Pruned AP. Pruned AP incorporates a technique for eliminating weak associations in the AP similarity matrix that leads to a reduction in the computational complexity of AP from O(N3) to O(N). This significantly reduces the time for data processing and network communication with little impact on clustering quality. In addition, the Drexel team has developed a distributed implementation of both AP and Pruned AP for the Hadoop MapReduce platform, resulting in even greater performance gains.