Artificial intelligence (AI) systems increasingly assist with decisions in areas such as, medicine, science, and education, but they still make simple reasoning mistakes. Thus, they can be unreliable in high-stakes settings. This project develops a science of AI reliability by identifying why these systems fail and by designing principled methods to make them more dependable, both on their own and when working with humans. The project's novelties are a unified framework that connects model-level robustness to human-AI collaboration, grounded in controlled mathematical environments that isolate real-world failure modes while remaining tractable for rigorous analysis. The approach bridges theory with targeted experimentation to produce insights that transfer to full-scale systems. The project's broader significance and importance are providing scientific foundations for safer deployment of AI in critical applications such as medical diagnosis and producing open-source evaluation tools for the research community. In addition to its technical objectives, this project extends its impact through expanding the Learning Theory Alliance mentorship program, promoting research with minimal computational resources, and developing new graduate and undergraduate courses. At the model level, the research characterizes why transformers, the backbone of modern AI systems, learn brittle shortcuts rather than robust algorithms and develops principled interventions in training data, training algorithms, and inference to make reasoning reliable by design. These include representative training sets that teach robust algorithmic behavior, sample-efficient methods for multi-step reasoning, and formal safeguards against adversarial manipulation. At the interaction level, the research develops a theory of human-AI collaboration under complementary information and imperfect alignment. The resulting protocols are auditable, grounded in verifiable conditions such as calibration, and enable humans and AI to combine information and achieve near-optimal outcomes even when the system is not perfectly aligned. Reliable collaboration requires reliable internal reasoning, and failures in collaboration reveal new requirements for model-level reliability. Across both directions, controlled test environments produce precise predictions validated on real benchmarks. This work strengthens the scientific basis for deploying AI safely where the cost of failure is high. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. NSF Award ID: 2543725 | Program: 01002627DB NSF RESEARCH & RELATED ACTIVIT,01002930DB NSF RESEARCH & RELATED ACTIVIT,01003031DB NSF RESEARCH & RELATED ACTIVIT | Principal Investigator: Surbhi Goel | Institution: University of Pennsylvania, PHILADELPHIA, PA | Award Amount: $354,579 View on NSF Award Search: https://www.nsf.gov/awardsearch/show-award/?AWD_ID=2543725 View on Research.gov: https://www.research.gov/awardapi-service/v1/awards/2543725.html

CAREER: A Full-Stack Science of AI Reliability: Robust Models and Tractable Collaboration

Description

Interested in this grant?

Grant Details

External Links

Get personalized grant matches