R Submission Pilot 7

Pilot 7 - Synthetic Clinical Trial Data

Goal: Develop and curate modern, realistic, and reusable synthetic clinical trial datasets that can serve as a high-quality public benchmark for evaluating R tools, AI-assisted analysis workflows, and end-to-end regulatory submissions.

Key evaluation aspects:

  • For the Development Team
    • Evaluate publicly available synthetic EDC data (e.g., from OpenClinica) as a foundation for realistic submission datasets
    • Leverage AI tools to convert and transform raw synthetic data (e.g., XML) into tidy, analysis-ready datasets aligned with CDISC standards
    • Develop skills and workflows tailored to clinical data simulation that will be open-sourced
  • For the Broader Community
    • Provide modern, publicly available synthetic clinical trial data that addresses limitations of the existing CDISC Pilot 1 dataset (outdated standards, limited scale, insufficient complexity)
    • Support tool demonstration, method evaluation, and community education across the pharmaceutical ecosystem

Data and analysis scope:

  • Targets Phase 3 clinical trial scale and complexity
  • Initial dataset sourced from OpenClinica (8 MB XML synthetic EDC data with no identifying information)
  • Data will be organized into separate directories by type within the repository
  • Designed to benchmark R tools, AI-assisted analysis, automation, and end-to-end submission workflows

Links:

R Submission Pilot 7 Development Repository

Key team members:

Developer team:

  • Brandon Theodorou (Keiji AI)
  • Eric Nantz (Eli Lilly)
  • J Sun (Keiji AI)
  • Ning Leng (AbbVie)
  • Robert Devine
  • Yilong Zhang
  • Zifeng Wang