Files
synthea-alldiseases/README.md

145 lines
4.5 KiB
Markdown

# Synthea All Diseases
A comprehensive pipeline for generating Synthea modules and synthetic patient data for any disease.
## Overview
This pipeline leverages Nextflow to orchestrate the generation of disease modules and synthetic patient data using Synthea. It supports:
1. Automatic generation of disease modules using Claude AI
2. Synthetic patient generation with configurable parameters using the actual Synthea engine
3. Analysis of generated patient data
## Requirements
- Docker
- Docker Compose
- Nextflow (version 20.10.0 or higher)
- Java (required by Nextflow)
- Python 3.6+ (if running scripts directly)
## Quick Start
The easiest way to get started is to use our convenience scripts:
```bash
# Set up the environment (builds Docker containers and prepares directories)
./scripts/prepare_environment.sh
# Run the pipeline for a specific disease
./scripts/run_pipeline.sh --disease "Parkinson's Disease" --patients --population 50
```
## Manual Setup
1. Clone this repository:
```bash
git clone https://github.com/yourusername/synthea-alldiseases.git
cd synthea-alldiseases
```
2. Create a `.env` file with your API keys (or copy from `.env.example`):
```bash
cp .env.example .env
# Edit .env with your preferred text editor
```
3. Build and start the Docker containers:
```bash
docker-compose build
docker-compose up -d synthea
```
## Usage
### Basic Command
```bash
nextflow run main.nf --disease_name "Disease Name" [options]
```
### Examples
Generate a module for Hypertension and create 100 patients:
```bash
nextflow run main.nf --disease_name "Hypertension" --generate_patients true --population 100 --gender 0.6
```
Generate a module for Parkinson's Disease, create 50 patients, and analyze the data:
```bash
nextflow run main.nf --disease_name "Parkinson's Disease" --generate_patients true --population 50 --analyze_patient_data true
```
### Parameters
| Parameter | Description | Default |
|-----------|-------------|---------|
| `--disease_name` | Name of the disease to model | (required) |
| `--modules_dir` | Directory for modules | `modules` |
| `--output_dir` | Directory for output files | `output` |
| `--generate_patients` | Generate patient data | `false` |
| `--population` | Number of patients to generate | `100` |
| `--gender` | Gender distribution (0-1 for % female) | `0.5` |
| `--min_age` | Minimum patient age | `0` |
| `--max_age` | Maximum patient age | `90` |
| `--seed` | Random seed for reproducibility | (random) |
| `--analyze_patient_data` | Analyze generated data | `false` |
| `--report_format` | Format for analysis report | `html` |
| `--force_generate` | Force regeneration of modules | `false` |
| `--publish_dir` | Directory for published output | `published_output` |
## Understanding the Data Flow
1. **Module Generation**: The pipeline first looks for an existing module for the specified disease. If not found, it generates one using the module_generator.
2. **Patient Generation**: If requested, the pipeline uses the actual Synthea engine to generate synthetic patient data based on the disease module.
3. **Analysis**: If requested, the pipeline analyzes the generated patient data and produces reports.
## Directory Structure
- `modules/`: Contains generated disease modules
- `module_generator/`: Contains the AI-powered module generation scripts
- `scripts/`: Utility scripts for the pipeline
- `output/`: Generated patient data (temporary)
- `published_output/`: Final output data that persists between runs
- `published_output/modules/`: Contains the generated modules
- `published_output/{disease_name}/`: Contains patient data for each disease
## Convenience Scripts
- `scripts/prepare_environment.sh`: Sets up the environment and starts containers
- `scripts/run_pipeline.sh`: Simplified interface for running the pipeline
- `scripts/analyze_patient_data.py`: Analyzes generated patient data
- `scripts/check_condition_structure.py`: Validates module JSON structure
## Troubleshooting
If you encounter issues:
1. Check that Docker containers are running:
```bash
docker ps | grep synthea
```
2. Ensure your modules directory has the required modules:
```bash
ls -la modules/
```
3. Check logs for detailed error messages:
```bash
tail -f .nextflow.log
```
4. Try rebuilding the Docker containers:
```bash
docker-compose down
docker-compose build
docker-compose up -d synthea
```
5. If module generation fails, check that your API keys are correctly set in the .env file
## License
This project uses the same license as Synthea.