openBALTIMORE, MD

CAREER: From Redundancy to Efficiency -- Scalable, Observable Multicast for AI Datacenters

National Science Foundation

Description

Artificial intelligence systems rely on large groups of computers working together, but communication between these machines is becoming a major bottleneck. Today, many systems send the same data repeatedly between machines, which wastes time, energy, and network capacity. This project develops new ways for computers to share information more efficiently by using multicast, a method that allows one machine to send data to many others at once, much like a radio broadcast. The goal is to make communication faster, more reliable, and easier to manage in large-scale computing systems that support modern artificial intelligence applications. This project addresses the scalability and observability challenges that currently prevent the deployment of multicast in production environments. The technical work is organized into three integrated thrusts. The first thrust develops a scalable data plane using topology-aware algorithms to construct efficient transmission trees and introduces a new way to compress network forwarding state to fit within the limited memory of standard hardware. The second thrust creates an introspectable control plane and monitoring system that uses advanced probes and machine learning to detect and localize hidden network failures in real time. The third thrust integrates these research findings into the university curriculum through the creation of hands-on laboratory modules and a structured mentoring pipeline for students. This project improves the efficiency and reliability of artificial intelligence infrastructure that supports applications such as healthcare, science, and large-scale data analysis. By reducing communication overhead, it lowers energy consumption and operational costs in datacenters. The project also contributes to workforce development by integrating research into hands-on educational modules, mentoring students, and providing practical experience with real-world systems. Open-source tools, datasets, and experimental platforms produced by the project will enable broader adoption in both academia and industry, strengthening the overall computing ecosystem. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. NSF Award ID: 2543556 | Program: 01002930DB NSF RESEARCH & RELATED ACTIVIT,01003031DB NSF RESEARCH & RELATED ACTIVIT,01002627DB NSF RESEARCH & RELATED ACTIVIT | Principal Investigator: Soudeh Ghorbani | Institution: Johns Hopkins University, BALTIMORE, MD | Award Amount: $389,092 View on NSF Award Search: https://www.nsf.gov/awardsearch/show-award/?AWD_ID=2543556 View on Research.gov: https://www.research.gov/awardapi-service/v1/awards/2543556.html

Interested in this grant?

Sign up to get match scores, save grants, and start your application with AI-powered tools.

Start Free Trial

Grant Details

Funding Range

$389,092 - $389,092

Deadline

June 30, 2031

Geographic Scope

BALTIMORE, MD

Status
open

External Links

View Original Listing

Want to see how well this grant matches your organization?

Get Your Match Score

Get personalized grant matches

Start your free trial to save opportunities, get AI-powered match scores, and manage your applications in one place.

Start Free Trial