Files
synthea-alldiseases/SYNTHEA_GUIDE.md

8.1 KiB

Synthea Disease Module Generator Guide

This guide explains how to use our Nextflow pipeline to generate Synthea disease modules and synthetic patient data.

Overview

Our pipeline provides three main functionalities:

  1. Disease Module Generation: Creates Synthea disease modules using Claude AI
  2. Synthetic Patient Generation: Uses the generated modules to create synthetic patient data with configurable demographic characteristics
  3. Patient Data Analysis: Generates statistics and reports from the synthetic patient data

Prerequisites

  • Nextflow installed
  • Synthea installed (with Java 8 or 11 compatibility)
  • Anthropic API key (for Claude)

Basic Usage

1. Generating Disease Modules

To generate disease modules for specific diseases:

# Generate a module for a single disease
nextflow run synthea_module_generator.nf --disease_name "Seasonal Allergies"

# Generate modules for multiple diseases
nextflow run synthea_module_generator.nf --disease_name "Asthma,Diabetes,Hypertension"

2. Generating Synthetic Patients

To generate synthetic patients with the specified diseases:

# Generate 100 patients with Asthma (default parameters)
nextflow run synthea_module_generator.nf --disease_name "Asthma" --generate_patients true

# Generate 1000 patients with specific parameters
nextflow run synthea_module_generator.nf \
  --disease_name "Diabetes" \
  --generate_patients true \
  --population 1000 \
  --gender 0.6 \
  --min_age 40 \
  --max_age 80 \
  --seed 12345 \
  --location "Massachusetts"

3. Analyzing Patient Data

To generate patients and analyze the resulting data:

# Generate patients and produce an HTML analysis report
nextflow run synthea_module_generator.nf \
  --disease_name "Asthma" \
  --generate_patients true \
  --population 500 \
  --analyze_output true \
  --report_format html

Available Parameters

Module Generation Parameters

  • --disease_name: Name of the disease(s) to generate (comma-separated for multiple)
  • --modules_dir: Directory for modules (default: src/main/resources/modules)
  • --batch_size: Number of modules to generate per batch (default: 1)
  • --max_cost: Maximum cost for API calls (default: 5.0 USD)
  • --timeout: Maximum time per batch in seconds (default: 300)

Patient Generation Parameters

Basic Patient Parameters

  • --generate_patients: Set to true to generate patients (default: false)
  • --population: Number of patients to generate (default: 100)
  • --gender: Gender distribution - M, F, or a decimal for percent female (e.g., 0.7 = 70% female)
  • --min_age: Minimum patient age (default: 0)
  • --max_age: Maximum patient age (default: 100)
  • --seed: Random seed for reproducibility (default: 12345)
  • --location: Location for patients (default: random)
  • --output_dir: Output directory for patients (default: output/synthetic_patients)

Enhanced Demographic Parameters

  • --race_ethnicity: Comma-separated list of races with percentages (e.g., white=0.7,hispanic=0.15,black=0.15)
  • --socioeconomic: Socioeconomic status distribution (e.g., high=0.2,middle=0.5,low=0.3)
  • --zip_codes: Comma-separated list of ZIP codes to distribute patients (e.g., 02138,02139,02140)

Disease Prevalence Parameters

  • --prevalence: Percentage of population with the disease, between 0.0 and 1.0 (e.g., 0.05 for 5%)
  • --comorbidities: Whether to include common comorbidities (set to true or false, default: false)

Analysis Parameters

  • --analyze_output: Whether to run analysis on the output (default: false)
  • --report_format: Format for the analysis report (html, csv, json) (default: html)

Detailed Configuration Guide

Controlling Demographics

You can precisely control the demographic distribution of your patient population:

Gender Distribution

# Generate all male patients
nextflow run synthea_module_generator.nf --disease_name "Prostate Cancer" --generate_patients true --gender M

# Generate all female patients
nextflow run synthea_module_generator.nf --disease_name "Ovarian Cancer" --generate_patients true --gender F

# Generate 60% female, 40% male
nextflow run synthea_module_generator.nf --disease_name "Diabetes" --generate_patients true --gender 0.6

Age Distribution

# Generate pediatric patients (0-18 years)
nextflow run synthea_module_generator.nf --disease_name "Asthma" --generate_patients true --min_age 0 --max_age 18

# Generate elderly patients (65+ years)
nextflow run synthea_module_generator.nf --disease_name "Parkinsons" --generate_patients true --min_age 65 --max_age 90

Race and Ethnicity

# Generate patients with specific racial distribution
nextflow run synthea_module_generator.nf \
  --disease_name "Hypertension" \
  --generate_patients true \
  --race_ethnicity "white=0.6,black=0.2,hispanic=0.15,asian=0.05"

Socioeconomic Status

# Generate patients with specific socioeconomic distribution
nextflow run synthea_module_generator.nf \
  --disease_name "Diabetes" \
  --generate_patients true \
  --socioeconomic "high=0.2,middle=0.5,low=0.3"

Disease Prevalence Simulation

You can control the prevalence of diseases in your synthetic population:

# Generate 1000 patients with 8% diabetes prevalence (realistic for US population)
nextflow run synthea_module_generator.nf \
  --disease_name "Diabetes" \
  --generate_patients true \
  --population 1000 \
  --prevalence 0.08

# Generate patients with comorbidities
nextflow run synthea_module_generator.nf \
  --disease_name "Hypertension" \
  --generate_patients true \
  --prevalence 0.3 \
  --comorbidities true

Analysis Reports

You can generate analysis reports in various formats:

# Generate HTML report (default)
nextflow run synthea_module_generator.nf \
  --disease_name "Asthma" \
  --generate_patients true \
  --analyze_output true

# Generate CSV reports
nextflow run synthea_module_generator.nf \
  --disease_name "Diabetes" \
  --generate_patients true \
  --analyze_output true \
  --report_format csv

Example Scenarios

Realistic Diabetes Population

Generate a realistic U.S. diabetes population with proper demographics:

nextflow run synthea_module_generator.nf \
  --disease_name "Diabetes" \
  --generate_patients true \
  --population 1000 \
  --prevalence 0.08 \
  --race_ethnicity "white=0.6,black=0.13,hispanic=0.18,asian=0.06,native=0.03" \
  --min_age 18 \
  --max_age 90 \
  --analyze_output true

Pediatric Asthma Study Cohort

Generate a pediatric asthma cohort for a simulated clinical study:

nextflow run synthea_module_generator.nf \
  --disease_name "Asthma" \
  --generate_patients true \
  --population 500 \
  --min_age 5 \
  --max_age 17 \
  --gender 0.5 \
  --prevalence 0.08 \
  --analyze_output true \
  --report_format html

Multi-Disease Elderly Population

Generate an elderly population with multiple chronic conditions:

nextflow run synthea_module_generator.nf \
  --disease_name "Hypertension,Arthritis,COPD" \
  --generate_patients true \
  --population 1000 \
  --min_age 65 \
  --max_age 90 \
  --comorbidities true \
  --analyze_output true

Analysis Report Details

The analysis report includes:

  1. Patient Demographics

    • Gender distribution
    • Age distribution (by age groups)
    • Race/ethnicity distribution
  2. Disease Statistics

    • Top 10 conditions in the patient population
    • Top 10 medications prescribed
  3. Summary Statistics

    • Total number of patients
    • Age ranges (min, max, average)

Troubleshooting

Compatibility Issues

If you encounter Java compatibility issues, ensure you're using Java 8 or 11 which are most compatible with Synthea:

# Set JAVA_HOME to Java 8 before running
export JAVA_HOME=/path/to/java8

Debugging Module Generation

If module generation fails:

  1. Check the .error file in the modules directory
  2. Verify your API key is set correctly
  3. Try generating a simpler disease first

Patient Generation Issues

If patient generation fails:

  1. Check that Synthea is properly installed
  2. Verify the modules exist in the modules directory
  3. Check that parameter values are within valid ranges