8.1 KiB
Synthea Disease Module Generator Guide
This guide explains how to use our Nextflow pipeline to generate Synthea disease modules and synthetic patient data.
Overview
Our pipeline provides three main functionalities:
- Disease Module Generation: Creates Synthea disease modules using Claude AI
- Synthetic Patient Generation: Uses the generated modules to create synthetic patient data with configurable demographic characteristics
- Patient Data Analysis: Generates statistics and reports from the synthetic patient data
Prerequisites
- Nextflow installed
- Synthea installed (with Java 8 or 11 compatibility)
- Anthropic API key (for Claude)
Basic Usage
1. Generating Disease Modules
To generate disease modules for specific diseases:
# Generate a module for a single disease
nextflow run synthea_module_generator.nf --disease_name "Seasonal Allergies"
# Generate modules for multiple diseases
nextflow run synthea_module_generator.nf --disease_name "Asthma,Diabetes,Hypertension"
2. Generating Synthetic Patients
To generate synthetic patients with the specified diseases:
# Generate 100 patients with Asthma (default parameters)
nextflow run synthea_module_generator.nf --disease_name "Asthma" --generate_patients true
# Generate 1000 patients with specific parameters
nextflow run synthea_module_generator.nf \
--disease_name "Diabetes" \
--generate_patients true \
--population 1000 \
--gender 0.6 \
--min_age 40 \
--max_age 80 \
--seed 12345 \
--location "Massachusetts"
3. Analyzing Patient Data
To generate patients and analyze the resulting data:
# Generate patients and produce an HTML analysis report
nextflow run synthea_module_generator.nf \
--disease_name "Asthma" \
--generate_patients true \
--population 500 \
--analyze_output true \
--report_format html
Available Parameters
Module Generation Parameters
--disease_name: Name of the disease(s) to generate (comma-separated for multiple)--modules_dir: Directory for modules (default:src/main/resources/modules)--batch_size: Number of modules to generate per batch (default: 1)--max_cost: Maximum cost for API calls (default: 5.0 USD)--timeout: Maximum time per batch in seconds (default: 300)
Patient Generation Parameters
Basic Patient Parameters
--generate_patients: Set totrueto generate patients (default:false)--population: Number of patients to generate (default: 100)--gender: Gender distribution -M,F, or a decimal for percent female (e.g., 0.7 = 70% female)--min_age: Minimum patient age (default: 0)--max_age: Maximum patient age (default: 100)--seed: Random seed for reproducibility (default: 12345)--location: Location for patients (default: random)--output_dir: Output directory for patients (default:output/synthetic_patients)
Enhanced Demographic Parameters
--race_ethnicity: Comma-separated list of races with percentages (e.g.,white=0.7,hispanic=0.15,black=0.15)--socioeconomic: Socioeconomic status distribution (e.g.,high=0.2,middle=0.5,low=0.3)--zip_codes: Comma-separated list of ZIP codes to distribute patients (e.g.,02138,02139,02140)
Disease Prevalence Parameters
--prevalence: Percentage of population with the disease, between 0.0 and 1.0 (e.g.,0.05for 5%)--comorbidities: Whether to include common comorbidities (set totrueorfalse, default:false)
Analysis Parameters
--analyze_output: Whether to run analysis on the output (default:false)--report_format: Format for the analysis report (html,csv,json) (default:html)
Detailed Configuration Guide
Controlling Demographics
You can precisely control the demographic distribution of your patient population:
Gender Distribution
# Generate all male patients
nextflow run synthea_module_generator.nf --disease_name "Prostate Cancer" --generate_patients true --gender M
# Generate all female patients
nextflow run synthea_module_generator.nf --disease_name "Ovarian Cancer" --generate_patients true --gender F
# Generate 60% female, 40% male
nextflow run synthea_module_generator.nf --disease_name "Diabetes" --generate_patients true --gender 0.6
Age Distribution
# Generate pediatric patients (0-18 years)
nextflow run synthea_module_generator.nf --disease_name "Asthma" --generate_patients true --min_age 0 --max_age 18
# Generate elderly patients (65+ years)
nextflow run synthea_module_generator.nf --disease_name "Parkinsons" --generate_patients true --min_age 65 --max_age 90
Race and Ethnicity
# Generate patients with specific racial distribution
nextflow run synthea_module_generator.nf \
--disease_name "Hypertension" \
--generate_patients true \
--race_ethnicity "white=0.6,black=0.2,hispanic=0.15,asian=0.05"
Socioeconomic Status
# Generate patients with specific socioeconomic distribution
nextflow run synthea_module_generator.nf \
--disease_name "Diabetes" \
--generate_patients true \
--socioeconomic "high=0.2,middle=0.5,low=0.3"
Disease Prevalence Simulation
You can control the prevalence of diseases in your synthetic population:
# Generate 1000 patients with 8% diabetes prevalence (realistic for US population)
nextflow run synthea_module_generator.nf \
--disease_name "Diabetes" \
--generate_patients true \
--population 1000 \
--prevalence 0.08
# Generate patients with comorbidities
nextflow run synthea_module_generator.nf \
--disease_name "Hypertension" \
--generate_patients true \
--prevalence 0.3 \
--comorbidities true
Analysis Reports
You can generate analysis reports in various formats:
# Generate HTML report (default)
nextflow run synthea_module_generator.nf \
--disease_name "Asthma" \
--generate_patients true \
--analyze_output true
# Generate CSV reports
nextflow run synthea_module_generator.nf \
--disease_name "Diabetes" \
--generate_patients true \
--analyze_output true \
--report_format csv
Example Scenarios
Realistic Diabetes Population
Generate a realistic U.S. diabetes population with proper demographics:
nextflow run synthea_module_generator.nf \
--disease_name "Diabetes" \
--generate_patients true \
--population 1000 \
--prevalence 0.08 \
--race_ethnicity "white=0.6,black=0.13,hispanic=0.18,asian=0.06,native=0.03" \
--min_age 18 \
--max_age 90 \
--analyze_output true
Pediatric Asthma Study Cohort
Generate a pediatric asthma cohort for a simulated clinical study:
nextflow run synthea_module_generator.nf \
--disease_name "Asthma" \
--generate_patients true \
--population 500 \
--min_age 5 \
--max_age 17 \
--gender 0.5 \
--prevalence 0.08 \
--analyze_output true \
--report_format html
Multi-Disease Elderly Population
Generate an elderly population with multiple chronic conditions:
nextflow run synthea_module_generator.nf \
--disease_name "Hypertension,Arthritis,COPD" \
--generate_patients true \
--population 1000 \
--min_age 65 \
--max_age 90 \
--comorbidities true \
--analyze_output true
Analysis Report Details
The analysis report includes:
-
Patient Demographics
- Gender distribution
- Age distribution (by age groups)
- Race/ethnicity distribution
-
Disease Statistics
- Top 10 conditions in the patient population
- Top 10 medications prescribed
-
Summary Statistics
- Total number of patients
- Age ranges (min, max, average)
Troubleshooting
Compatibility Issues
If you encounter Java compatibility issues, ensure you're using Java 8 or 11 which are most compatible with Synthea:
# Set JAVA_HOME to Java 8 before running
export JAVA_HOME=/path/to/java8
Debugging Module Generation
If module generation fails:
- Check the
.errorfile in the modules directory - Verify your API key is set correctly
- Try generating a simpler disease first
Patient Generation Issues
If patient generation fails:
- Check that Synthea is properly installed
- Verify the modules exist in the modules directory
- Check that parameter values are within valid ranges