High-Performance Computing (HPC) has revolutionized various scientific fields, including climate research, wildlife health, agricultural sciences, and scientific simulations and modeling. With the emergence of HPC-accelerated deep learning (HPC-DL) systems and applications, there is a pressing need for comprehensive cross-layer training materials to educate the research workforce on these advanced technologies. The primary objective of this pilot project is to address this need by providing comprehensive cross-layer HPC-DL training to a wide range of cyberinfrastructure (CI) users. The target audience includes undergraduate and graduate students, postdocs, faculty, and research staff who can benefit from enhanced knowledge and skills in utilizing HPC-DL CI technologies and resources. By equipping them with the necessary training, the project aims to improve their research efficiency and maximize the potential of HPC-DL in their respective fields. In addition, the project has a specific focus on fostering inclusivity and expanding opportunities for underrepresented communities in the Central Valley area of California. This will contribute to the national interest by empowering individuals with the knowledge and skills necessary to excel in the HPC-DL field. This project addresses the critical training needs of the converged HPC-DL field by developing comprehensive training materials, fostering peer consultant programs, conducting workshops, and building an inclusive learning culture. It includes an integration of scientific applications, HPC technologies, and DL in a cross-layer approach. The training program covers several important CI topics, including Remote Direct Memory Access (RDMA), GPU-based distributed computing, Slurm, MPI, and NCCL, which are critical to achieving high performance for HPC-DL workloads. The training will also dive into distributed DL training frameworks such as PyTorch, TensorFlow, and Horovod, enabling participants to effectively leverage these tools for their research. Moreover, the training incorporates practical DL application case studies, offering real-world examples and insights. The short-term goal is to empower individuals with HPC-DL knowledge and cross-layer optimization skills to maximize the utilization of HPC-DL CI resources and improve research efficiency. This project will also examine the effectiveness of practice-central models and HPC-DL-centered workshops in promoting HPC-DL adoption in underrepresented communities. The project's long-term aim is to cultivate a robust research workforce with a deep understanding of HPC-DL CIs. By establishing a learning culture and targeting a significant number of CI users, this project addresses workforce shortages and extends its impact beyond the Central Valley. Through collaborations and the dissemination of open-source training materials, it will contribute to advancing compute- and data-intensive scientific simulations and knowledge discovery. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. NSF Award ID: 2623546 | Program: 01002324DB NSF RESEARCH & RELATED ACTIVIT | Principal Investigator: Xiaoyi Lu | Institution: University of Florida, GAINESVILLE, FL | Award Amount: $176,965 View on NSF Award Search: https://www.nsf.gov/awardsearch/show-award/?AWD_ID=2623546 View on Research.gov: https://www.research.gov/awardapi-service/v1/awards/2623546.html

CyberTraining: Pilot: Cross-Layer Training of High-Performance Deep Learning Technologies and Applications for Research Workforce Development in Central Valley

Description

Interested in this grant?

Grant Details

External Links

Get personalized grant matches