Scientific findings should come with error rates that mean what they say: among findings assigned a 5 percent chance of error, about 5 in 100 should turn out to be wrong. This standard, called calibration, underlies trusted probability claims from weather forecasting to machine learning (ML), but it is not yet a routine part of the statistical tools used in many large-scale scientific studies. The issue arises whenever researchers must triage long lists of possible discoveries, anomalies, or published claims. In metascience, the question is which findings in the literature will replicate; in artificial intelligence (AI) safety, which suspicious model inputs deserve greater scrutiny. Current methods control the average error rate across an entire list of discoveries, but they rarely provide individual findings with calibrated error probabilities. This award supports research on calibrated hypothesis testing, which will develop methods that distinguish strong evidence from borderline evidence with interpretable, rigorous guarantees. The work will support more reproducible science and safer data-driven AI/ML systems, while training graduate researchers, developing new instructional materials, and releasing open-source software. This project will develop theory and methodology for calibrated, large-scale inference. The framework draws upon probabilistic forecasting but addresses a distinct challenge: unlike forecasting, where labels are eventually observed, in multiple testing the ground truth is never revealed, so calibration must be assessed stochastically and established indirectly. The investigators will combine empirical Bayes estimation with frequentist finite-sample guarantees, extending local and boundary false discovery rates beyond settings with independent p-values. Variable selection will serve as the first setting, using knockoff and sign-symmetric statistics to construct local error assessments for selected variables. Conformal outlier detection will extend these ideas to discrete and dependent p-values produced by a shared calibration dataset. Online testing will build on both directions by treating sequential threshold choice as an online learning problem under distribution drift. Together, these three settings will demonstrate that calibrated local error rates constitute a fully functional statistical concept with broad applicability. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. NSF Award ID: 2610644 | Program: 01002627DB NSF RESEARCH & RELATED ACTIVIT | Principal Investigator: William Fithian | Institution: University of California-Berkeley, BERKELEY, CA | Award Amount: $133,285 View on NSF Award Search: https://www.nsf.gov/awardsearch/show-award/?AWD_ID=2610644 View on Research.gov: https://www.research.gov/awardapi-service/v1/awards/2610644.html

Collaborative Research: Calibrated Hypothesis Testing

Description

Interested in this grant?

Grant Details

View the application link

Get personalized grant matches