The objective of this project is to improve understanding of how synthetic data generators, as a privacy enhancing technology, work with large real-world data (RWD) (e.g., datasets with over 30 billion rows of data) in a secure super compute environment. Synthetic data are developed through statistical or machine learning techniques that result in a dataset that contains new records with similar aggregate statistical properties as the original dataset while protecting privacy and confidentiality. Synthetic data also provide a tiered access approach to restricted data that may be more difficult to access otherwise. This work will inform the development of a synthetic data generator toolkit that will include but not be limited to methods to assess privacy risk, data utility and open-source AI methods to generate synthetic data.
Synthetic Data Generation with Large, Real-World Data
Project leads
Lisa Mirel (NSF/NCSES), May Aydin (NSF/NCSES), Ken Gersing (NIH/NCATS), Sam Michaels (NIH/NCATS)
Advancing AI Research
This project will implement open-source synthetic data generation which relies on machine learning models and ensure fidelity to the truth source.
Advancing NAIRR Infrastructure
This project will utilize a secure super computing platform to generate synthetic data and conduct data quality assessments compared to the truth source.
Broader Impacts
The resulting lessons learned will inform the National Secure Data Service (NSDS) Demonstration project by providing the research and statistical communities access to AI capabilities, a model for increased secure access opportunities, and methods for assessing risk and utility to use data for evidence building.
Innovative Partnerships
This is a joint project between the NAIRR pilot and NSDS Demonstration which are independent initiatives with expected synergies as reflected in the CHIPS and Science Act requirement of the NSDS Demonstration for consultation with the NAIRR Task Force in the NSDS development. In addition, NIH data sets, DOE computing resources, and NCSES’s America’s DataHub Consortium (ADC) will be used to ensure success of the project.