High-quality data are increasingly central to modern machine learning and artificial intelligence, enabling advances in scientific discovery, automated decision-making, and emerging AI technologies. Yet there often lack transparent and reliable mechanisms to appropriately credit and compensate those who contribute data used to train AI systems. This project will develop statistical and machine-learning methods for measuring the value of data in AI model training and data-driven decision systems. The work addresses fundamental challenges in data valuation, including robustness to strategic manipulation, computational scalability for large-scale learning systems, and principled uncertainty quantification in assigning value to data contributions. The outcomes of this project will support transparent, fair, and sustainable AI data ecosystems while improving incentives for sharing high-quality and socially beneficial data. The project will also support graduate and undergraduate training, development of educational materials, public dissemination of results, and open-source software for the broader AI and data science communities. The research will develop statistical foundations for scalable and robust Shapley-value-based data valuation in modern machine learning through three integrated directions. First, it will develop priority-aware valuation rules that incorporate precedence relationships and priority weights, enabling originality, provenance, and individual risk considerations to be incorporated within a unified axiomatic framework for AI data attribution. Second, it will study the statistical and computational limits of approximating Shapley values and related semi-values in high-dimensional and large-scale learning settings, with the goal of designing efficient estimation and approximation algorithms for contemporary AI models. Third, it will develop a population-level theory of data value through Shapley density, a continuous analogue of finite-sample data valuation, establish convergence guarantees, and provide methods for accelerated computation and principled uncertainty quantification. Together, these contributions are expected to advance the statistical and algorithmic foundations of AI data valuation, enable scalable and trustworthy assessment of training data contributions in machine learning systems, and support fair and robust data-sharing ecosystems for future AI applications. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. NSF Award ID: 2610424 | Program: 01002627DB NSF RESEARCH & RELATED ACTIVIT | Principal Investigator: Yuan Zhang | Institution: OHIO STATE UNIVERSITY, THE, COLUMBUS, OH | Award Amount: $150,000 View on NSF Award Search: https://www.nsf.gov/awardsearch/show-award/?AWD_ID=2610424 View on Research.gov: https://www.research.gov/awardapi-service/v1/awards/2610424.html

Collaborative Research: Statistical Foundations for Scalable and Robust Data Valuation

Description

Interested in this grant?

Grant Details

View the application link

Get personalized grant matches