The rapid growth of large language models (LLMs) has enabled major advances in artificial intelligence (AI), including systems that assist with writing, coding, education, and decision-making. However, training these models demands enormous computing resources, creating significant challenges across multiple dimensions, including model quality, training time, energy efficiency, and reliability. Although many optimization techniques have been proposed, most focus on only one or a few aspects of training, leaving their overall impact on total training efficiency unclear. This project addresses this gap by developing a systematic understanding of the trade-offs among existing optimization strategies and by delivering a quantitative efficiency model that enables informed, cost-aware decision making for LLM training. In addition, the project advances optimization methods in underexplored areas, particularly energy efficiency and reliability. The anticipated outcomes will promote more sustainable computing practices, strengthen national competitiveness in AI, and support applications that advance economic growth, education, national security, and public services. The project also establishes an integrated education program to support workforce development and expand participation in advanced computing and the AI industry. This project develops a unified framework for analyzing and optimizing LLM training efficiency across performance, energy consumption, reliability, and model quality, addressing the growing gap between the unprecedented resource demands of LLM training and the limitations of existing optimization approaches. The research comprises three primary components. First, it develops a novel efficiency model that integrates performance, energy, reliability, and quality optimizations to enable holistic decision making for large-scale training systems. Second, it designs a mathematically grounded, checkpoint-free fault tolerance mechanism that improves error detection and correction while reducing end-to-end training costs and mitigating failure-related interruptions. Third, it develops a knowledge-driven energy optimization approach that enhances the energy efficiency of large-scale LLM training and expands the performance-energy trade-off space to meet diverse cost constraints. The resulting techniques will be integrated into leading large-scale training frameworks and evaluated using state-of-the-art workloads to demonstrate improvements in scalability, robustness, and cost efficiency. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. NSF Award ID: 2540555 | Program: 01002627DB NSF RESEARCH & RELATED ACTIVIT,01003031DB NSF RESEARCH & RELATED ACTIVIT,01002930DB NSF RESEARCH & RELATED ACTIVIT | Principal Investigator: Jieyang Chen | Institution: University of Oregon Eugene, EUGENE, OR | Award Amount: $308,232 View on NSF Award Search: https://www.nsf.gov/awardsearch/show-award/?AWD_ID=2540555 View on Research.gov: https://www.research.gov/awardapi-service/v1/awards/2540555.html

CAREER: ProTrain: Enabling Efficient Large Language Model Training via Performance, Energy, and Reliability Co-optimizations

Description

Interested in this grant?

Grant Details

External Links

Get personalized grant matches