Vision foundation models (VFMs) are artificial intelligence systems for "all-purpose" understanding of images and videos. They are currently extremely expensive to create. This high cost restricts their creation to a few highly resourced institutions and leaves independent researchers and the public unable to fully explore how these systems learn. This project seeks to democratize this research by creating a highly efficient training method inspired by how human infants learn. A human child acquires foundational visual skills from a limited number of waking hours compared to the massive amount of data used by current VFMs. By using longitudinal video and audio recorded from the viewpoint of infants, this project develops a training process that is affordable for university budgets. Innovating and understanding how to train these systems efficiently using this infant-inspired approach will increase accessibility to artificial intelligence research for the broader public. Furthermore, the project provides unique educational opportunities for students and offers insights that can be transferred to specialized industries, such as medical imaging and vocational training, where data is often limited. Expanding community involvement in building these models will ultimately promote artificial intelligence safety, enhance transparency, and build public trust. The technical goal of this project is to formalize a developmentally plausible, data-efficient pretraining framework for VFMs. First, the team of researchers will establish a core framework by curating longitudinal, egocentric audiovisual recordings of human infants and designing a suite of evaluation benchmarks strictly aligned with early cognitive milestones. Second, the project bridges inherent sensory and temporal gaps in the recordings. This involves employing model ensembling to simulate tactile and gustatory senses from audiovisual cues and utilizing a meta-learning formulation to optimally mix heterogeneous data sources. Third, the investigators will design novel model architectures and pretraining algorithms tailored for a continuous "baby learning" paradigm. To achieve this, the research incorporates continuous-state Hopfield networks to serve as an expansive associative memory module, which mitigates catastrophic forgetting. Moreover, the project introduces a monotonic neural network for non-linear uncertainty calibration without sacrificing the accuracy of the pretext tasks. By integrating these three thrusts, the project will yield open-source baseline models, developmental benchmarks, and algorithms that enable the broader scientific community to investigate highly efficient pretraining methodologies. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. NSF Award ID: 2540851 | Program: 01002930DB NSF RESEARCH & RELATED ACTIVIT,01003031DB NSF RESEARCH & RELATED ACTIVIT,01002627DB NSF RESEARCH & RELATED ACTIVIT | Principal Investigator: Boqing Gong | Institution: Trustees of Boston University, BOSTON, MA | Award Amount: $371,191 View on NSF Award Search: https://www.nsf.gov/awardsearch/show-award/?AWD_ID=2540851 View on Research.gov: https://www.research.gov/awardapi-service/v1/awards/2540851.html

CAREER: Democratizing the Pretraining of Vision Foundation Models: A Developmentally Plausible Framework

Description

Interested in this grant?

Grant Details

External Links

Get personalized grant matches