Trying to fix basic functionality again.

This commit is contained in:
2025-03-23 11:53:47 -07:00
parent ebda48190a
commit 2141e81f42
406 changed files with 173963 additions and 69 deletions

280
SYNTHEA_GUIDE.md Normal file
View File

@@ -0,0 +1,280 @@
# Synthea Disease Module Generator Guide
This guide explains how to use our Nextflow pipeline to generate Synthea disease modules and synthetic patient data.
## Overview
Our pipeline provides three main functionalities:
1. **Disease Module Generation**: Creates Synthea disease modules using Claude AI
2. **Synthetic Patient Generation**: Uses the generated modules to create synthetic patient data with configurable demographic characteristics
3. **Patient Data Analysis**: Generates statistics and reports from the synthetic patient data
## Prerequisites
- Nextflow installed
- Synthea installed (with Java 8 or 11 compatibility)
- Anthropic API key (for Claude)
## Basic Usage
### 1. Generating Disease Modules
To generate disease modules for specific diseases:
```bash
# Generate a module for a single disease
nextflow run synthea_module_generator.nf --disease_name "Seasonal Allergies"
# Generate modules for multiple diseases
nextflow run synthea_module_generator.nf --disease_name "Asthma,Diabetes,Hypertension"
```
### 2. Generating Synthetic Patients
To generate synthetic patients with the specified diseases:
```bash
# Generate 100 patients with Asthma (default parameters)
nextflow run synthea_module_generator.nf --disease_name "Asthma" --generate_patients true
# Generate 1000 patients with specific parameters
nextflow run synthea_module_generator.nf \
--disease_name "Diabetes" \
--generate_patients true \
--population 1000 \
--gender 0.6 \
--min_age 40 \
--max_age 80 \
--seed 12345 \
--location "Massachusetts"
```
### 3. Analyzing Patient Data
To generate patients and analyze the resulting data:
```bash
# Generate patients and produce an HTML analysis report
nextflow run synthea_module_generator.nf \
--disease_name "Asthma" \
--generate_patients true \
--population 500 \
--analyze_output true \
--report_format html
```
## Available Parameters
### Module Generation Parameters
- `--disease_name`: Name of the disease(s) to generate (comma-separated for multiple)
- `--modules_dir`: Directory for modules (default: `src/main/resources/modules`)
- `--batch_size`: Number of modules to generate per batch (default: 1)
- `--max_cost`: Maximum cost for API calls (default: 5.0 USD)
- `--timeout`: Maximum time per batch in seconds (default: 300)
### Patient Generation Parameters
#### Basic Patient Parameters
- `--generate_patients`: Set to `true` to generate patients (default: `false`)
- `--population`: Number of patients to generate (default: 100)
- `--gender`: Gender distribution - `M`, `F`, or a decimal for percent female (e.g., 0.7 = 70% female)
- `--min_age`: Minimum patient age (default: 0)
- `--max_age`: Maximum patient age (default: 100)
- `--seed`: Random seed for reproducibility (default: 12345)
- `--location`: Location for patients (default: random)
- `--output_dir`: Output directory for patients (default: `output/synthetic_patients`)
#### Enhanced Demographic Parameters
- `--race_ethnicity`: Comma-separated list of races with percentages (e.g., `white=0.7,hispanic=0.15,black=0.15`)
- `--socioeconomic`: Socioeconomic status distribution (e.g., `high=0.2,middle=0.5,low=0.3`)
- `--zip_codes`: Comma-separated list of ZIP codes to distribute patients (e.g., `02138,02139,02140`)
#### Disease Prevalence Parameters
- `--prevalence`: Percentage of population with the disease, between 0.0 and 1.0 (e.g., `0.05` for 5%)
- `--comorbidities`: Whether to include common comorbidities (set to `true` or `false`, default: `false`)
### Analysis Parameters
- `--analyze_output`: Whether to run analysis on the output (default: `false`)
- `--report_format`: Format for the analysis report (`html`, `csv`, `json`) (default: `html`)
## Detailed Configuration Guide
### Controlling Demographics
You can precisely control the demographic distribution of your patient population:
#### Gender Distribution
```bash
# Generate all male patients
nextflow run synthea_module_generator.nf --disease_name "Prostate Cancer" --generate_patients true --gender M
# Generate all female patients
nextflow run synthea_module_generator.nf --disease_name "Ovarian Cancer" --generate_patients true --gender F
# Generate 60% female, 40% male
nextflow run synthea_module_generator.nf --disease_name "Diabetes" --generate_patients true --gender 0.6
```
#### Age Distribution
```bash
# Generate pediatric patients (0-18 years)
nextflow run synthea_module_generator.nf --disease_name "Asthma" --generate_patients true --min_age 0 --max_age 18
# Generate elderly patients (65+ years)
nextflow run synthea_module_generator.nf --disease_name "Parkinsons" --generate_patients true --min_age 65 --max_age 90
```
#### Race and Ethnicity
```bash
# Generate patients with specific racial distribution
nextflow run synthea_module_generator.nf \
--disease_name "Hypertension" \
--generate_patients true \
--race_ethnicity "white=0.6,black=0.2,hispanic=0.15,asian=0.05"
```
#### Socioeconomic Status
```bash
# Generate patients with specific socioeconomic distribution
nextflow run synthea_module_generator.nf \
--disease_name "Diabetes" \
--generate_patients true \
--socioeconomic "high=0.2,middle=0.5,low=0.3"
```
### Disease Prevalence Simulation
You can control the prevalence of diseases in your synthetic population:
```bash
# Generate 1000 patients with 8% diabetes prevalence (realistic for US population)
nextflow run synthea_module_generator.nf \
--disease_name "Diabetes" \
--generate_patients true \
--population 1000 \
--prevalence 0.08
# Generate patients with comorbidities
nextflow run synthea_module_generator.nf \
--disease_name "Hypertension" \
--generate_patients true \
--prevalence 0.3 \
--comorbidities true
```
### Analysis Reports
You can generate analysis reports in various formats:
```bash
# Generate HTML report (default)
nextflow run synthea_module_generator.nf \
--disease_name "Asthma" \
--generate_patients true \
--analyze_output true
# Generate CSV reports
nextflow run synthea_module_generator.nf \
--disease_name "Diabetes" \
--generate_patients true \
--analyze_output true \
--report_format csv
```
## Example Scenarios
### Realistic Diabetes Population
Generate a realistic U.S. diabetes population with proper demographics:
```bash
nextflow run synthea_module_generator.nf \
--disease_name "Diabetes" \
--generate_patients true \
--population 1000 \
--prevalence 0.08 \
--race_ethnicity "white=0.6,black=0.13,hispanic=0.18,asian=0.06,native=0.03" \
--min_age 18 \
--max_age 90 \
--analyze_output true
```
### Pediatric Asthma Study Cohort
Generate a pediatric asthma cohort for a simulated clinical study:
```bash
nextflow run synthea_module_generator.nf \
--disease_name "Asthma" \
--generate_patients true \
--population 500 \
--min_age 5 \
--max_age 17 \
--gender 0.5 \
--prevalence 0.08 \
--analyze_output true \
--report_format html
```
### Multi-Disease Elderly Population
Generate an elderly population with multiple chronic conditions:
```bash
nextflow run synthea_module_generator.nf \
--disease_name "Hypertension,Arthritis,COPD" \
--generate_patients true \
--population 1000 \
--min_age 65 \
--max_age 90 \
--comorbidities true \
--analyze_output true
```
## Analysis Report Details
The analysis report includes:
1. **Patient Demographics**
- Gender distribution
- Age distribution (by age groups)
- Race/ethnicity distribution
2. **Disease Statistics**
- Top 10 conditions in the patient population
- Top 10 medications prescribed
3. **Summary Statistics**
- Total number of patients
- Age ranges (min, max, average)
## Troubleshooting
### Compatibility Issues
If you encounter Java compatibility issues, ensure you're using Java 8 or 11 which are most compatible with Synthea:
```bash
# Set JAVA_HOME to Java 8 before running
export JAVA_HOME=/path/to/java8
```
### Debugging Module Generation
If module generation fails:
1. Check the `.error` file in the modules directory
2. Verify your API key is set correctly
3. Try generating a simpler disease first
### Patient Generation Issues
If patient generation fails:
1. Check that Synthea is properly installed
2. Verify the modules exist in the modules directory
3. Check that parameter values are within valid ranges