Scalable Data Clustering with Partial Affinity Propagation on MapReduce
Overview
Data clustering is an invaluable process in modern information systems enabling a range of applications including query expansion, search result organization, and interactive searching in exploratory settings. However, major challenges associated with clustering efficiency and scalability have hindered broad adoption. Computational complexity is a particular challenge when applying clustering techniques on the web because of the web’s scale. In recent years a technique called Affinity Propagation (AP) has grown in popularity. AP improves error rates and clustering speed relative to previous solutions such as k-centers clustering, however at web-scale AP continues to suffer efficiency problems. To address the inefficiencies of existing clustering techniques a team of Drexel researchers, led by Dr. Weimao Ke, have developed a modified AP algorithm called Pruned AP. Pruned AP incorporates a technique for eliminating weak associations in the AP similarity matrix that leads to a reduction in the computational complexity of AP from O(N3) to O(N). This significantly reduces the time for data processing and network communication with little impact on clustering quality. In addition, the Drexel team has developed a distributed implementation of both AP and Pruned AP for the Hadoop MapReduce platform, resulting in even greater performance gains.
Applications
- Information Retrieval
- Query expansion
- Search result organization
- Interactive search in exploratory settings
Advantages
- Faster clustering speed
- Comparable clustering quality
- Leverages additional Hadooop MapReduce performance gains
References
Ke, Weimao, Xiaoli Song, Sheik Hassan, and Xuemei Gong. "Scalable Text Clustering with Partial Affinity Propagation on MapReduce." In ACM WSDM 2015 Workshop on Scalable Data Analytics: Theory and Applications (SDATA'15), 1-9. Shanghai, China, 2015.
Commercialization Opportunities