This project studies how to create synthetic datasets that retain useful patterns from sensitive data while protecting privacy of individuals. Many hospitals, companies, public agencies, and researchers need data to improve services, test ideas, etc., but they often cannot share original records because they contain private information. This project addresses this gap by making data sharing safer and more useful. The project's novelties are creating a general way to break synthetic data generation into two connected steps, new methods that combine classical statistical ideas with modern learning tools, and systematic ways to use public data and existing models without weakening privacy protection. The project's broader significance and importance are that it expands safe access to data for research and education, strengthens privacy practice in data-driven fields, and creates training and research opportunities for students. Specifically, the research develops a framework that separates synthetic data generation into information extraction from sensitive data under formal privacy protection based on differential privacy and reconstruction of synthetic data from the extracted information. Within this framework, the project has three research thrusts. First, for tabular data, it examines why statistical methods often outperform neural network methods and designs hybrid methods that combine strengths from both approaches. Second, for image and multimodal data, it studies adaptive high-order projections, including Fourier representations, to capture broad structure and preserve relationships across data types. Third, it develops a double-cone framework for selecting, expanding, and adapting public data sources and for using existing models so that public information can improve synthetic data quality in a systematic way. The project also brings these ideas into courses, student research, open-source tools, and public demonstrations. The expected results are stronger foundations and more practical methods for privacy-protected synthetic data generation across application areas. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. NSF Award ID: 2543284 | Program: 01002930DB NSF RESEARCH & RELATED ACTIVIT,01003031DB NSF RESEARCH & RELATED ACTIVIT,01002627DB NSF RESEARCH & RELATED ACTIVIT | Principal Investigator: Tianhao Wang | Institution: University of Virginia Main Campus, CHARLOTTESVILLE, VA | Award Amount: $395,277 View on NSF Award Search: https://www.nsf.gov/awardsearch/show-award/?AWD_ID=2543284 View on Research.gov: https://www.research.gov/awardapi-service/v1/awards/2543284.html

CAREER: Advancing Differentially Private Data Synthesis: A Holistic Approach

Description

Interested in this grant?

Grant Details

External Links

Get personalized grant matches