Multimodal datasets, which combine sources such as medical imaging, clinical records, and genetic information, have the potential to significantly advance our understanding of complex systems and improve health outcomes. However, the heterogeneity, high dimensionality, and lack of reliable statistical tools often lead to unstable analyses or misleading conclusions. These issues — and the limited ability to rigorously quantify uncertainty or disentangle relationships among data sources — pose a major barrier to the adoption of data-driven methods in high-stakes settings, where the cost of error can be substantial (e.g., clinical decision-making, disease monitoring, and health policy). This research project will develop robust, scalable, and statistically principled methods for integrating and analyzing multimodal data, with a particular emphasis on uncertainty quantification. The project also integrates research and education through: (a) the involvement of undergraduate, graduate, and postdoctoral students in both research and dissemination, along with mentoring to support their continued professional development; (b) the integration of research findings in UCLA courses and openly accessible online materials; and (c) workshops and outreach activities designed to broaden participation in data science. In more detail, the research focuses on the challenge of nonparametric estimation and uncertainty quantification for multimodal data, in which multiple high-dimensional and heterogeneous data sources must be integrated to enable reliable inference. The initial goal is to develop robust and scalable methods for estimating the effects of individual modalities, utilizing deep learning to model auxiliary structures and employing kernel-based techniques to provide uncertainty quantification. Armed with such methods, the follow-up goal is to construct machine learning-powered estimators that identify and quantify the pathways through which modalities influence outcomes, combining multi-level Monte Carlo techniques with deep learning predictions and observational data to uncover complex mediator effects. A third goal is to extend these approaches to longitudinal settings, enabling inference in the presence of temporal dynamics and high-dimensional confounding. The work in this project will leverage and build upon techniques and tools from deep learning, empirical process theory, Monte Carlo methods, and reproducing kernel Hilbert spaces to ensure both statistical rigor and computational efficiency. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. NSF Award ID: 2515903 | Program: 01002627DB NSF RESEARCH & RELATED ACTIVIT | Principal Investigator: Xiaowu Dai | Institution: University of California-Los Angeles, LOS ANGELES, CA | Award Amount: $139,995 View on NSF Award Search: https://www.nsf.gov/awardsearch/show-award/?AWD_ID=2515903 View on Research.gov: https://www.research.gov/awardapi-service/v1/awards/2515903.html

Non-parametric estimation for multimodal data: From statistical theory to efficient algorithms

Description

Interested in this grant?

Grant Details

External Links

Get personalized grant matches