Videos are central to how people learn, communicate, and understand the world, capturing subtle differences in actions that distinguish expert performance from novice attempts. From surgical procedures and athletic movements to everyday tasks and instructional content, these fine-grained motions carry critical information about intent, skill, and outcome. However, today’s artificial intelligence systems struggle to interpret such nuanced dynamics, limiting their effectiveness in real-world applications such as education, healthcare, robotics, and assistive technologies. While existing systems can recognize broad activities, they often fail to identify precise actions, track objects over time, or explain how events unfold. They also struggle to efficiently analyze long videos or multiple videos at once, which are common in practical settings. This project addresses the urgent need for open and efficient tools that can better understand how actions unfold over time in videos. By advancing the ability of AI systems to interpret dynamic visual information, the project will promote progress in science and engineering, support workforce development through improved training technologies, enhance accessibility through assistive video understanding systems, and broaden participation in AI by releasing openly available resources for researchers, educators, and developers. This project develops a new class of open video-language models designed to overcome fundamental limitations in how current systems represent and reason about video. The research focuses on three tightly integrated innovations. First, it introduces trajectory-based video tokenization methods that represent videos using motion and object-centric units, reducing redundancy while preserving important temporal and spatial structure. Second, it designs flexible encoder architectures that can process multiple videos at varying resolutions, allowing models to dynamically allocate computational resources to both long-term temporal patterns and fine-grained visual details, and enabling few-shot reasoning across multiple examples. Third, it develops multimodal decoding methods that produce both textual and spatial outputs, such as object tracks, segmentations, and event descriptions, enabling grounded reasoning about dynamic scenes. Together, these components form a unified framework for understanding complex real-world video data. The project will develop new algorithms and architectures, train models on large-scale datasets, and evaluate them on benchmarks for fine-grained action recognition, temporal reasoning, and grounded video understanding. This work is expected to establish a new foundation for video-based AI systems capable of supporting applications in robotics, scientific analysis, education, and beyond. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. NSF Award ID: 2541049 | Program: 01002627DB NSF RESEARCH & RELATED ACTIVIT,01002930DB NSF RESEARCH & RELATED ACTIVIT,01003031DB NSF RESEARCH & RELATED ACTIVIT | Principal Investigator: Ranjay Krishna | Institution: University of Washington, SEATTLE, WA | Award Amount: $348,442 View on NSF Award Search: https://www.nsf.gov/awardsearch/show-award/?AWD_ID=2541049 View on Research.gov: https://www.research.gov/awardapi-service/v1/awards/2541049.html

CAREER: Open & Grounded Video-Language Models

Description

Interested in this grant?

Grant Details

View the application link

Get personalized grant matches