Scalable Data Clustering with Partial Affinity Propagation on MapReduce

Overview

Data clustering is an invaluable process in modern information systems enabling a range of applications including query expansion, search result organization, and interactive searching in exploratory settings.  However, major challenges associated with clustering efficiency and scalability have hindered broad adoption. Computational complexity is a particular challenge when applying clustering techniques on the web because of the web’s scale.  In recent years a technique called Affinity Propagation (AP) has grown in popularity.  AP improves error rates and clustering speed relative to previous solutions such as k-centers clustering, however at web-scale AP continues to suffer efficiency problems.  To address the inefficiencies of existing clustering techniques a team of Drexel researchers, led by Dr. Weimao Ke, have developed a modified AP algorithm called Pruned AP.  Pruned AP incorporates a technique for eliminating weak associations in the AP similarity matrix that leads to a reduction in the computational complexity of AP from O(N3) to O(N).  This significantly reduces the time for data processing and network communication with little impact on clustering quality. In addition, the Drexel team has developed a distributed implementation of both AP and Pruned AP for the Hadoop MapReduce platform, resulting in even greater performance gains.

Applications

  • Information Retrieval
  • Query expansion
  • Search result organization
  • Interactive search in exploratory settings

Advantages

  • Faster clustering speed 
  • Comparable clustering quality
  • Leverages additional Hadooop MapReduce performance gains

References

Ke, Weimao, Xiaoli Song, Sheik Hassan, and Xuemei Gong. "Scalable Text Clustering with Partial Affinity Propagation on MapReduce." In ACM WSDM 2015 Workshop on Scalable Data Analytics: Theory and Applications (SDATA'15), 1-9. Shanghai, China, 2015.

 

 

Commercialization Opportunities

 

Contact Information

 

Robert B. McGrath, Ph.D.

Senior Associate Vice Provost

Office of Technology Commercialization

Drexel University

3180 Chestnut Street, Ste. 104

The Left Bank

Philadelphia, PA 19104

Phone: 215-895-0303

E-mail: RBM26@drexel.edu

 

 

 

For Technical Information:

 

Weimao Ke, Ph.D.

Assistant Professor

College of Computing and Informatics

3141 Chestnut Street

Philadelphia, PA 19194

Phone: (215) 895 5912

Email: wk@drexel.edu

Web: http://drexel.edu/cci/contact/Faculty/Ke-Weimao/