Traditionally, real-world clinical data has been difficult to leverage for AI research because of challenges in amassing large volumes of medical reports and limitations in sharing these reports due to patient privacy concerns. These challenges make it difficult to obtain sufficient numbers of samples from underrepresented classes for effective learning, effectively benchmark trained models, and to deploy these models in operational settings. This project leverages the existing collaboration between the Department of Energy and the National Cancer Institute, alone with an existing and certified secure data enclave and secure computing environment at Oak Ridge National Laboratory, to using AI to generate a high-fidelity synthetic dataset from unstructured clinical text. We will verify that the synthetic data sufficiently captures key features of the real-world cancer data and will characterize the “fit” of the synthetic data to develop metrics to access the quality of the synthetic data set and metrics to access the accuracy of the synthetic data for downstream classifier performance. A distinct AI security research team will analyze the synthetic data in a secure environment to attempt uncover any personal information that were leaked from the real-world data and will develop metrics for the security of the synthetic dataset. The final deliverable will be a dataset resembling real-world cancer data from six states in the United States with metrics for Quality, Accuracy, Trustworthiness, and Security.
NAIRR Secure: Democratizing AI for cancer with privacy preserving synthetic data generation for cancer case identification
Project leads, key team members
Heidi Hanson, John Gounley, Patrycja Krawczuk, Adam Spannaus, Christopher Stanley, Jiayi Wang, Edmon Begoli
Key elements
- Democratizing AI for cancer research
- Enabling AI research at population scale
- Improving methods for anonymization and privacy preservation
- Innovative Partnerships: National Cancer Institute
- Advancing computational health sciences
More information
Learn more at https://datascience.cancer.gov/collaborations/nci-department-energy-collaborations/mossaic