Knowledge acquisition—the ability of artificial intelligence (AI) systems to extract actionable insights from vast amounts of unstructured text—is critical for advancements in healthcare, education, and scientific discovery. While Large Language Models (LLMs) have shown impressive capabilities, their reliability depends heavily on massive, perfectly curated datasets, which are expensive and often unavailable in specialized domains. This CAREER project addresses this bottleneck by developing a new paradigm called “structure-aware weak supervision.” Instead of relying on perfect human annotations, the project enables AI systems to learn autonomously from incomplete, noisy, and ambiguous data by discovering and utilizing underlying semantic structures, such as concept hierarchies and retrieval pathways. By reducing the dependency on expensive labeled data, this research democratizes the development of highly accurate, domain-specific AI tools for resource-constrained environments, such as public health agencies and community organizations. The project also integrates these research outcomes into new undergraduate and graduate curricula, open-source educational toolkits, and targeted K-12 outreach programs designed to broaden participation in computing and teach the next generation how to build reliable, human-centered AI systems. This project proposes a unified framework for learning under weak supervision by bridging unstructured language data with structured, interpretable knowledge representations. The research is organized into three synergistic thrusts. Thrust 1 tackles incomplete supervision by inducing latent ontologies from unlabeled corpora via a novel Spherical Hierarchical Expectation-Maximization (SHEM) algorithm, enabling scalable information extraction and classification without predefined schemas. Thrust 2 addresses noisy supervision by designing a Denoising Retrieval-Augmented Generation (DeRAG) framework. It integrates symbolic reasoning over the induced ontologies with Structure-Aware Contrastive Retrieval (SACRet) to actively filter distractors and reliably ground language model outputs. Thrust 3 tackles ambiguous supervision by modeling complex, multi-faceted human preferences. It introduces a Tree of Reward Models (TreeRM) and Hierarchical Dirichlet Thompson Sampling (HDTS) to capture both shared foundational values (e.g., safety, factuality) and personalized user preferences (e.g., tone), ensuring robust AI alignment. Together, these contributions advance the theoretical foundations and practical methodologies of knowledge-centric AI, creating systems that autonomously construct knowledge, dynamically adapt to supervision gaps, and reliably align with hierarchical human values. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. NSF Award ID: 2541536 | Program: 01002627DB NSF RESEARCH & RELATED ACTIVIT,01002930DB NSF RESEARCH & RELATED ACTIVIT,01003031DB NSF RESEARCH & RELATED ACTIVIT | Principal Investigator: Yu Meng | Institution: University of Virginia Main Campus, CHARLOTTESVILLE, VA | Award Amount: $538,798 View on NSF Award Search: https://www.nsf.gov/awardsearch/show-award/?AWD_ID=2541536 View on Research.gov: https://www.research.gov/awardapi-service/v1/awards/2541536.html

CAREER: Structure-Aware Learning from Weak Supervision for Knowledge Acquisition

Description

Interested in this grant?

Grant Details

View the application link

Get personalized grant matches