Olamide Isreal 67bd6692b0 Clean up pipeline files
- Remove unused variable and redundant comments/echo statements in main.nf
- Remove obsolete files: simple.nf, test.nf, generate_patients.sh,
  test_synthea.sh, trace.txt, docker-compose.yml
  (all referenced local-only synthea-module-generator image)
2026-03-25 15:09:07 +01:00
2025-03-23 11:53:47 -07:00
2026-03-25 15:09:07 +01:00

Synthea All Diseases

A comprehensive pipeline for generating Synthea modules and synthetic patient data for any disease.

Overview

This pipeline leverages Nextflow to orchestrate the generation of disease modules and synthetic patient data using Synthea. It supports:

  1. Automatic generation of disease modules using Claude AI
  2. Synthetic patient generation with configurable parameters using the actual Synthea engine
  3. Analysis of generated patient data

Requirements

  • Docker
  • Docker Compose
  • Nextflow (version 20.10.0 or higher)
  • Java (required by Nextflow)
  • Python 3.6+ (if running scripts directly)

Quick Start

The easiest way to get started is to use our convenience scripts:

# Set up the environment (builds Docker containers and prepares directories)
./scripts/prepare_environment.sh

# Run the pipeline for a specific disease
./scripts/run_pipeline.sh --disease "Parkinson's Disease" --patients --population 50

Manual Setup

  1. Clone this repository:

    git clone https://github.com/yourusername/synthea-alldiseases.git
    cd synthea-alldiseases
    
  2. Create a .env file with your API keys (or copy from .env.example):

    cp .env.example .env
    # Edit .env with your preferred text editor
    
  3. Build and start the Docker containers:

    docker-compose build
    docker-compose up -d synthea
    

Usage

Basic Command

nextflow run main.nf --disease_name "Disease Name" [options]

Examples

Generate a module for Hypertension and create 100 patients:

nextflow run main.nf --disease_name "Hypertension" --generate_patients true --population 100 --gender 0.6

Generate a module for Parkinson's Disease, create 50 patients, and analyze the data:

nextflow run main.nf --disease_name "Parkinson's Disease" --generate_patients true --population 50 --analyze_patient_data true

Parameters

Parameter Description Default
--disease_name Name of the disease to model (required)
--modules_dir Directory for modules modules
--output_dir Directory for output files output
--generate_patients Generate patient data false
--population Number of patients to generate 100
--gender Gender distribution (0-1 for % female) 0.5
--min_age Minimum patient age 0
--max_age Maximum patient age 90
--seed Random seed for reproducibility (random)
--analyze_patient_data Analyze generated data false
--report_format Format for analysis report html
--force_generate Force regeneration of modules false
--publish_dir Directory for published output published_output

Understanding the Data Flow

  1. Module Generation: The pipeline first looks for an existing module for the specified disease. If not found, it generates one using the module_generator.
  2. Patient Generation: If requested, the pipeline uses the actual Synthea engine to generate synthetic patient data based on the disease module.
  3. Analysis: If requested, the pipeline analyzes the generated patient data and produces reports.

Directory Structure

  • modules/: Contains generated disease modules
  • module_generator/: Contains the AI-powered module generation scripts
  • scripts/: Utility scripts for the pipeline
  • output/: Generated patient data (temporary)
  • published_output/: Final output data that persists between runs
    • published_output/modules/: Contains the generated modules
    • published_output/{disease_name}/: Contains patient data for each disease

Convenience Scripts

  • scripts/prepare_environment.sh: Sets up the environment and starts containers
  • scripts/run_pipeline.sh: Simplified interface for running the pipeline
  • scripts/analyze_patient_data.py: Analyzes generated patient data
  • scripts/check_condition_structure.py: Validates module JSON structure

Troubleshooting

If you encounter issues:

  1. Check that Docker containers are running:

    docker ps | grep synthea
    
  2. Ensure your modules directory has the required modules:

    ls -la modules/
    
  3. Check logs for detailed error messages:

    tail -f .nextflow.log
    
  4. Try rebuilding the Docker containers:

    docker-compose down
    docker-compose build
    docker-compose up -d synthea
    
  5. If module generation fails, check that your API keys are correctly set in the .env file

License

This project uses the same license as Synthea.

Description
Synthea patient generator - all diseases module
Readme Apache-2.0 783 KiB
Languages
Python 85.8%
Shell 11.6%
Nextflow 1.4%
Dockerfile 1.2%