Clustering, a fundamental technique in data analysis and machine learning, involves grouping data points to ensure higher similarities within a group (cluster) than across clusters. Since its inception in the early 20th century, clustering has proven highly valuable in diverse fields such as biology, economics, marketing, statistics, computer science, and social network analysis. This CAREER project aims to create a unified framework along with various tools and techniques to design optimal approximation algorithms for a broad range of NP-hard clustering problems. Despite significant progress in developing efficient approximations for computationally hard clustering problems, the analysis of the large-scale and complex datasets remains challenging resulting in suboptimal accuracy and limiting practical applications in science and engineering. The algorithmic development and analysis are expected to have a direct impact on the fields of data science and bioinformatics. The educational components of the project include a 3-day summer workshop for high school students, training and mentoring of graduate students and the development of new courses. The project comprises two primary thrusts. The first thrust focuses on addressing clustering challenges in strings with special emphasis on centroid-based clustering problems such as k-median and k-center. This study will encompass various metric spaces including edit distance, Ulam, and Kendall tau. The objective is to formulate methodologies that transcend the limitations set by the triangle inequality, yielding polynomial-time algorithms with approximations arbitrarily close to one. The second thrust focuses on issues related to hierarchical clustering. Here the objective is to evaluate how well the techniques and frameworks originally devised for addressing string clustering problems can be adapted for aggregating hierarchical clusters. Additionally, the project will develop new methods for constructing hierarchical clusters specifically tailored to address challenges related to large datasets. The new tools and combinatorial methods developed in this project are expected to enhance algorithms for clustering problems and should find applicability in various other domains including communication complexity, streaming algorithms, and distributed systems. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. NSF Award ID: 2337832 | Program: 01002829DB NSF RESEARCH & RELATED ACTIVIT,01002425DB NSF RESEARCH & RELATED ACTIVIT,01002728DB NSF RESEARCH & RELATED ACTIVIT,01002627DB NSF RESEARCH & RELATED ACTIVIT,01002526DB NSF RESEARCH & RELATED ACTIVIT | Principal Investigator: Debarati Das | Institution: Pennsylvania State Univ University Park, UNIVERSITY PARK, PA | Award Amount: $508,736 View on NSF Award Search: https://www.nsf.gov/awardsearch/showAward?AWD_ID=2337832 View on Research.gov: https://www.research.gov/awardapi-service/v1/awards/2337832.html

CAREER: A Theoretical Exploration of Efficient and Accurate Clustering Algorithms

Description

Interested in this grant?

Grant Details

External Links

Get personalized grant matches