280 lines
8.1 KiB
Markdown
280 lines
8.1 KiB
Markdown
# Synthea Disease Module Generator Guide
|
|
|
|
This guide explains how to use our Nextflow pipeline to generate Synthea disease modules and synthetic patient data.
|
|
|
|
## Overview
|
|
|
|
Our pipeline provides three main functionalities:
|
|
|
|
1. **Disease Module Generation**: Creates Synthea disease modules using Claude AI
|
|
2. **Synthetic Patient Generation**: Uses the generated modules to create synthetic patient data with configurable demographic characteristics
|
|
3. **Patient Data Analysis**: Generates statistics and reports from the synthetic patient data
|
|
|
|
## Prerequisites
|
|
|
|
- Nextflow installed
|
|
- Synthea installed (with Java 8 or 11 compatibility)
|
|
- Anthropic API key (for Claude)
|
|
|
|
## Basic Usage
|
|
|
|
### 1. Generating Disease Modules
|
|
|
|
To generate disease modules for specific diseases:
|
|
|
|
```bash
|
|
# Generate a module for a single disease
|
|
nextflow run synthea_module_generator.nf --disease_name "Seasonal Allergies"
|
|
|
|
# Generate modules for multiple diseases
|
|
nextflow run synthea_module_generator.nf --disease_name "Asthma,Diabetes,Hypertension"
|
|
```
|
|
|
|
### 2. Generating Synthetic Patients
|
|
|
|
To generate synthetic patients with the specified diseases:
|
|
|
|
```bash
|
|
# Generate 100 patients with Asthma (default parameters)
|
|
nextflow run synthea_module_generator.nf --disease_name "Asthma" --generate_patients true
|
|
|
|
# Generate 1000 patients with specific parameters
|
|
nextflow run synthea_module_generator.nf \
|
|
--disease_name "Diabetes" \
|
|
--generate_patients true \
|
|
--population 1000 \
|
|
--gender 0.6 \
|
|
--min_age 40 \
|
|
--max_age 80 \
|
|
--seed 12345 \
|
|
--location "Massachusetts"
|
|
```
|
|
|
|
### 3. Analyzing Patient Data
|
|
|
|
To generate patients and analyze the resulting data:
|
|
|
|
```bash
|
|
# Generate patients and produce an HTML analysis report
|
|
nextflow run synthea_module_generator.nf \
|
|
--disease_name "Asthma" \
|
|
--generate_patients true \
|
|
--population 500 \
|
|
--analyze_output true \
|
|
--report_format html
|
|
```
|
|
|
|
## Available Parameters
|
|
|
|
### Module Generation Parameters
|
|
|
|
- `--disease_name`: Name of the disease(s) to generate (comma-separated for multiple)
|
|
- `--modules_dir`: Directory for modules (default: `src/main/resources/modules`)
|
|
- `--batch_size`: Number of modules to generate per batch (default: 1)
|
|
- `--max_cost`: Maximum cost for API calls (default: 5.0 USD)
|
|
- `--timeout`: Maximum time per batch in seconds (default: 300)
|
|
|
|
### Patient Generation Parameters
|
|
|
|
#### Basic Patient Parameters
|
|
- `--generate_patients`: Set to `true` to generate patients (default: `false`)
|
|
- `--population`: Number of patients to generate (default: 100)
|
|
- `--gender`: Gender distribution - `M`, `F`, or a decimal for percent female (e.g., 0.7 = 70% female)
|
|
- `--min_age`: Minimum patient age (default: 0)
|
|
- `--max_age`: Maximum patient age (default: 100)
|
|
- `--seed`: Random seed for reproducibility (default: 12345)
|
|
- `--location`: Location for patients (default: random)
|
|
- `--output_dir`: Output directory for patients (default: `output/synthetic_patients`)
|
|
|
|
#### Enhanced Demographic Parameters
|
|
- `--race_ethnicity`: Comma-separated list of races with percentages (e.g., `white=0.7,hispanic=0.15,black=0.15`)
|
|
- `--socioeconomic`: Socioeconomic status distribution (e.g., `high=0.2,middle=0.5,low=0.3`)
|
|
- `--zip_codes`: Comma-separated list of ZIP codes to distribute patients (e.g., `02138,02139,02140`)
|
|
|
|
#### Disease Prevalence Parameters
|
|
- `--prevalence`: Percentage of population with the disease, between 0.0 and 1.0 (e.g., `0.05` for 5%)
|
|
- `--comorbidities`: Whether to include common comorbidities (set to `true` or `false`, default: `false`)
|
|
|
|
### Analysis Parameters
|
|
- `--analyze_output`: Whether to run analysis on the output (default: `false`)
|
|
- `--report_format`: Format for the analysis report (`html`, `csv`, `json`) (default: `html`)
|
|
|
|
## Detailed Configuration Guide
|
|
|
|
### Controlling Demographics
|
|
|
|
You can precisely control the demographic distribution of your patient population:
|
|
|
|
#### Gender Distribution
|
|
|
|
```bash
|
|
# Generate all male patients
|
|
nextflow run synthea_module_generator.nf --disease_name "Prostate Cancer" --generate_patients true --gender M
|
|
|
|
# Generate all female patients
|
|
nextflow run synthea_module_generator.nf --disease_name "Ovarian Cancer" --generate_patients true --gender F
|
|
|
|
# Generate 60% female, 40% male
|
|
nextflow run synthea_module_generator.nf --disease_name "Diabetes" --generate_patients true --gender 0.6
|
|
```
|
|
|
|
#### Age Distribution
|
|
|
|
```bash
|
|
# Generate pediatric patients (0-18 years)
|
|
nextflow run synthea_module_generator.nf --disease_name "Asthma" --generate_patients true --min_age 0 --max_age 18
|
|
|
|
# Generate elderly patients (65+ years)
|
|
nextflow run synthea_module_generator.nf --disease_name "Parkinsons" --generate_patients true --min_age 65 --max_age 90
|
|
```
|
|
|
|
#### Race and Ethnicity
|
|
|
|
```bash
|
|
# Generate patients with specific racial distribution
|
|
nextflow run synthea_module_generator.nf \
|
|
--disease_name "Hypertension" \
|
|
--generate_patients true \
|
|
--race_ethnicity "white=0.6,black=0.2,hispanic=0.15,asian=0.05"
|
|
```
|
|
|
|
#### Socioeconomic Status
|
|
|
|
```bash
|
|
# Generate patients with specific socioeconomic distribution
|
|
nextflow run synthea_module_generator.nf \
|
|
--disease_name "Diabetes" \
|
|
--generate_patients true \
|
|
--socioeconomic "high=0.2,middle=0.5,low=0.3"
|
|
```
|
|
|
|
### Disease Prevalence Simulation
|
|
|
|
You can control the prevalence of diseases in your synthetic population:
|
|
|
|
```bash
|
|
# Generate 1000 patients with 8% diabetes prevalence (realistic for US population)
|
|
nextflow run synthea_module_generator.nf \
|
|
--disease_name "Diabetes" \
|
|
--generate_patients true \
|
|
--population 1000 \
|
|
--prevalence 0.08
|
|
|
|
# Generate patients with comorbidities
|
|
nextflow run synthea_module_generator.nf \
|
|
--disease_name "Hypertension" \
|
|
--generate_patients true \
|
|
--prevalence 0.3 \
|
|
--comorbidities true
|
|
```
|
|
|
|
### Analysis Reports
|
|
|
|
You can generate analysis reports in various formats:
|
|
|
|
```bash
|
|
# Generate HTML report (default)
|
|
nextflow run synthea_module_generator.nf \
|
|
--disease_name "Asthma" \
|
|
--generate_patients true \
|
|
--analyze_output true
|
|
|
|
# Generate CSV reports
|
|
nextflow run synthea_module_generator.nf \
|
|
--disease_name "Diabetes" \
|
|
--generate_patients true \
|
|
--analyze_output true \
|
|
--report_format csv
|
|
```
|
|
|
|
## Example Scenarios
|
|
|
|
### Realistic Diabetes Population
|
|
|
|
Generate a realistic U.S. diabetes population with proper demographics:
|
|
|
|
```bash
|
|
nextflow run synthea_module_generator.nf \
|
|
--disease_name "Diabetes" \
|
|
--generate_patients true \
|
|
--population 1000 \
|
|
--prevalence 0.08 \
|
|
--race_ethnicity "white=0.6,black=0.13,hispanic=0.18,asian=0.06,native=0.03" \
|
|
--min_age 18 \
|
|
--max_age 90 \
|
|
--analyze_output true
|
|
```
|
|
|
|
### Pediatric Asthma Study Cohort
|
|
|
|
Generate a pediatric asthma cohort for a simulated clinical study:
|
|
|
|
```bash
|
|
nextflow run synthea_module_generator.nf \
|
|
--disease_name "Asthma" \
|
|
--generate_patients true \
|
|
--population 500 \
|
|
--min_age 5 \
|
|
--max_age 17 \
|
|
--gender 0.5 \
|
|
--prevalence 0.08 \
|
|
--analyze_output true \
|
|
--report_format html
|
|
```
|
|
|
|
### Multi-Disease Elderly Population
|
|
|
|
Generate an elderly population with multiple chronic conditions:
|
|
|
|
```bash
|
|
nextflow run synthea_module_generator.nf \
|
|
--disease_name "Hypertension,Arthritis,COPD" \
|
|
--generate_patients true \
|
|
--population 1000 \
|
|
--min_age 65 \
|
|
--max_age 90 \
|
|
--comorbidities true \
|
|
--analyze_output true
|
|
```
|
|
|
|
## Analysis Report Details
|
|
|
|
The analysis report includes:
|
|
|
|
1. **Patient Demographics**
|
|
- Gender distribution
|
|
- Age distribution (by age groups)
|
|
- Race/ethnicity distribution
|
|
|
|
2. **Disease Statistics**
|
|
- Top 10 conditions in the patient population
|
|
- Top 10 medications prescribed
|
|
|
|
3. **Summary Statistics**
|
|
- Total number of patients
|
|
- Age ranges (min, max, average)
|
|
|
|
## Troubleshooting
|
|
|
|
### Compatibility Issues
|
|
|
|
If you encounter Java compatibility issues, ensure you're using Java 8 or 11 which are most compatible with Synthea:
|
|
|
|
```bash
|
|
# Set JAVA_HOME to Java 8 before running
|
|
export JAVA_HOME=/path/to/java8
|
|
```
|
|
|
|
### Debugging Module Generation
|
|
|
|
If module generation fails:
|
|
1. Check the `.error` file in the modules directory
|
|
2. Verify your API key is set correctly
|
|
3. Try generating a simpler disease first
|
|
|
|
### Patient Generation Issues
|
|
|
|
If patient generation fails:
|
|
1. Check that Synthea is properly installed
|
|
2. Verify the modules exist in the modules directory
|
|
3. Check that parameter values are within valid ranges |