Training modern artificial intelligence systems requires large amounts of computing time, energy, and money. Many of the optimization methods used to train neural networks are still chosen largely through trial and error because existing theory does not adequately explain why some methods work better than others on different model architectures. This project will develop a scientific foundation for making training faster, more reliable, and more resource efficient by linking optimization methods to the structure of the neural networks they are used to train. The project can reduce the cost and energy use of model training, provide more dependable guidance for practitioners, and support efficient and reliable artificial intelligence development. It will also support graduate education in optimization for deep learning, research-preparation activities for undergraduates, and hands-on artificial intelligence learning modules for local high school students, with participation in project activities open to all. This project studies how neural network architecture influences optimization through three complementary directions: structured preconditioning, optimization methods adapted to different notions of distance, and scale invariance induced by normalization layers. The research will analyze representative model components such as multilayer perceptrons, attention modules, embedding parameters, and layers preceding normalization. It will develop theory explaining when optimization methods matched to the model architecture improve training efficiency, characterize their behavior on losses whose landscape geometry is shaped by the network architecture, and study how optimization affects the quality of learned solutions beyond training loss alone, including downstream tasks and performance when data differ from the training distribution. These ideas will be tested through controlled experiments on representative neural network architectures and larger-scale validation on production-relevant training pipelines. The project will release open-source software, reproducible experimental configurations, benchmark results, and public documentation. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. NSF Award ID: 2544658 | Program: 01002627DB NSF RESEARCH & RELATED ACTIVIT,01002930DB NSF RESEARCH & RELATED ACTIVIT,01003031DB NSF RESEARCH & RELATED ACTIVIT | Principal Investigator: Zhiyuan Li | Institution: Toyota Technological Institute at Chicago, CHICAGO, IL | Award Amount: $349,963 View on NSF Award Search: https://www.nsf.gov/awardsearch/show-award/?AWD_ID=2544658 View on Research.gov: https://www.research.gov/awardapi-service/v1/awards/2544658.html

CAREER: An Architecture-Aware Optimization Theory for Deep Learning: Non-Euclidean Descent, Structured Preconditioning, and Scale Invariance

Description

Interested in this grant?

Grant Details

External Links

Get personalized grant matches