# Synthea Disease Module Generator Guide This guide explains how to use our Nextflow pipeline to generate Synthea disease modules and synthetic patient data. ## Overview Our pipeline provides three main functionalities: 1. **Disease Module Generation**: Creates Synthea disease modules using Claude AI 2. **Synthetic Patient Generation**: Uses the generated modules to create synthetic patient data with configurable demographic characteristics 3. **Patient Data Analysis**: Generates statistics and reports from the synthetic patient data ## Prerequisites - Nextflow installed - Synthea installed (with Java 8 or 11 compatibility) - Anthropic API key (for Claude) ## Basic Usage ### 1. Generating Disease Modules To generate disease modules for specific diseases: ```bash # Generate a module for a single disease nextflow run synthea_module_generator.nf --disease_name "Seasonal Allergies" # Generate modules for multiple diseases nextflow run synthea_module_generator.nf --disease_name "Asthma,Diabetes,Hypertension" ``` ### 2. Generating Synthetic Patients To generate synthetic patients with the specified diseases: ```bash # Generate 100 patients with Asthma (default parameters) nextflow run synthea_module_generator.nf --disease_name "Asthma" --generate_patients true # Generate 1000 patients with specific parameters nextflow run synthea_module_generator.nf \ --disease_name "Diabetes" \ --generate_patients true \ --population 1000 \ --gender 0.6 \ --min_age 40 \ --max_age 80 \ --seed 12345 \ --location "Massachusetts" ``` ### 3. Analyzing Patient Data To generate patients and analyze the resulting data: ```bash # Generate patients and produce an HTML analysis report nextflow run synthea_module_generator.nf \ --disease_name "Asthma" \ --generate_patients true \ --population 500 \ --analyze_output true \ --report_format html ``` ## Available Parameters ### Module Generation Parameters - `--disease_name`: Name of the disease(s) to generate (comma-separated for multiple) - `--modules_dir`: Directory for modules (default: `src/main/resources/modules`) - `--batch_size`: Number of modules to generate per batch (default: 1) - `--max_cost`: Maximum cost for API calls (default: 5.0 USD) - `--timeout`: Maximum time per batch in seconds (default: 300) ### Patient Generation Parameters #### Basic Patient Parameters - `--generate_patients`: Set to `true` to generate patients (default: `false`) - `--population`: Number of patients to generate (default: 100) - `--gender`: Gender distribution - `M`, `F`, or a decimal for percent female (e.g., 0.7 = 70% female) - `--min_age`: Minimum patient age (default: 0) - `--max_age`: Maximum patient age (default: 100) - `--seed`: Random seed for reproducibility (default: 12345) - `--location`: Location for patients (default: random) - `--output_dir`: Output directory for patients (default: `output/synthetic_patients`) #### Enhanced Demographic Parameters - `--race_ethnicity`: Comma-separated list of races with percentages (e.g., `white=0.7,hispanic=0.15,black=0.15`) - `--socioeconomic`: Socioeconomic status distribution (e.g., `high=0.2,middle=0.5,low=0.3`) - `--zip_codes`: Comma-separated list of ZIP codes to distribute patients (e.g., `02138,02139,02140`) #### Disease Prevalence Parameters - `--prevalence`: Percentage of population with the disease, between 0.0 and 1.0 (e.g., `0.05` for 5%) - `--comorbidities`: Whether to include common comorbidities (set to `true` or `false`, default: `false`) ### Analysis Parameters - `--analyze_output`: Whether to run analysis on the output (default: `false`) - `--report_format`: Format for the analysis report (`html`, `csv`, `json`) (default: `html`) ## Detailed Configuration Guide ### Controlling Demographics You can precisely control the demographic distribution of your patient population: #### Gender Distribution ```bash # Generate all male patients nextflow run synthea_module_generator.nf --disease_name "Prostate Cancer" --generate_patients true --gender M # Generate all female patients nextflow run synthea_module_generator.nf --disease_name "Ovarian Cancer" --generate_patients true --gender F # Generate 60% female, 40% male nextflow run synthea_module_generator.nf --disease_name "Diabetes" --generate_patients true --gender 0.6 ``` #### Age Distribution ```bash # Generate pediatric patients (0-18 years) nextflow run synthea_module_generator.nf --disease_name "Asthma" --generate_patients true --min_age 0 --max_age 18 # Generate elderly patients (65+ years) nextflow run synthea_module_generator.nf --disease_name "Parkinsons" --generate_patients true --min_age 65 --max_age 90 ``` #### Race and Ethnicity ```bash # Generate patients with specific racial distribution nextflow run synthea_module_generator.nf \ --disease_name "Hypertension" \ --generate_patients true \ --race_ethnicity "white=0.6,black=0.2,hispanic=0.15,asian=0.05" ``` #### Socioeconomic Status ```bash # Generate patients with specific socioeconomic distribution nextflow run synthea_module_generator.nf \ --disease_name "Diabetes" \ --generate_patients true \ --socioeconomic "high=0.2,middle=0.5,low=0.3" ``` ### Disease Prevalence Simulation You can control the prevalence of diseases in your synthetic population: ```bash # Generate 1000 patients with 8% diabetes prevalence (realistic for US population) nextflow run synthea_module_generator.nf \ --disease_name "Diabetes" \ --generate_patients true \ --population 1000 \ --prevalence 0.08 # Generate patients with comorbidities nextflow run synthea_module_generator.nf \ --disease_name "Hypertension" \ --generate_patients true \ --prevalence 0.3 \ --comorbidities true ``` ### Analysis Reports You can generate analysis reports in various formats: ```bash # Generate HTML report (default) nextflow run synthea_module_generator.nf \ --disease_name "Asthma" \ --generate_patients true \ --analyze_output true # Generate CSV reports nextflow run synthea_module_generator.nf \ --disease_name "Diabetes" \ --generate_patients true \ --analyze_output true \ --report_format csv ``` ## Example Scenarios ### Realistic Diabetes Population Generate a realistic U.S. diabetes population with proper demographics: ```bash nextflow run synthea_module_generator.nf \ --disease_name "Diabetes" \ --generate_patients true \ --population 1000 \ --prevalence 0.08 \ --race_ethnicity "white=0.6,black=0.13,hispanic=0.18,asian=0.06,native=0.03" \ --min_age 18 \ --max_age 90 \ --analyze_output true ``` ### Pediatric Asthma Study Cohort Generate a pediatric asthma cohort for a simulated clinical study: ```bash nextflow run synthea_module_generator.nf \ --disease_name "Asthma" \ --generate_patients true \ --population 500 \ --min_age 5 \ --max_age 17 \ --gender 0.5 \ --prevalence 0.08 \ --analyze_output true \ --report_format html ``` ### Multi-Disease Elderly Population Generate an elderly population with multiple chronic conditions: ```bash nextflow run synthea_module_generator.nf \ --disease_name "Hypertension,Arthritis,COPD" \ --generate_patients true \ --population 1000 \ --min_age 65 \ --max_age 90 \ --comorbidities true \ --analyze_output true ``` ## Analysis Report Details The analysis report includes: 1. **Patient Demographics** - Gender distribution - Age distribution (by age groups) - Race/ethnicity distribution 2. **Disease Statistics** - Top 10 conditions in the patient population - Top 10 medications prescribed 3. **Summary Statistics** - Total number of patients - Age ranges (min, max, average) ## Troubleshooting ### Compatibility Issues If you encounter Java compatibility issues, ensure you're using Java 8 or 11 which are most compatible with Synthea: ```bash # Set JAVA_HOME to Java 8 before running export JAVA_HOME=/path/to/java8 ``` ### Debugging Module Generation If module generation fails: 1. Check the `.error` file in the modules directory 2. Verify your API key is set correctly 3. Try generating a simpler disease first ### Patient Generation Issues If patient generation fails: 1. Check that Synthea is properly installed 2. Verify the modules exist in the modules directory 3. Check that parameter values are within valid ranges