145 lines
4.5 KiB
Markdown
145 lines
4.5 KiB
Markdown
# Synthea All Diseases
|
|
|
|
A comprehensive pipeline for generating Synthea modules and synthetic patient data for any disease.
|
|
|
|
## Overview
|
|
|
|
This pipeline leverages Nextflow to orchestrate the generation of disease modules and synthetic patient data using Synthea. It supports:
|
|
|
|
1. Automatic generation of disease modules using Claude AI
|
|
2. Synthetic patient generation with configurable parameters using the actual Synthea engine
|
|
3. Analysis of generated patient data
|
|
|
|
## Requirements
|
|
|
|
- Docker
|
|
- Docker Compose
|
|
- Nextflow (version 20.10.0 or higher)
|
|
- Java (required by Nextflow)
|
|
- Python 3.6+ (if running scripts directly)
|
|
|
|
## Quick Start
|
|
|
|
The easiest way to get started is to use our convenience scripts:
|
|
|
|
```bash
|
|
# Set up the environment (builds Docker containers and prepares directories)
|
|
./scripts/prepare_environment.sh
|
|
|
|
# Run the pipeline for a specific disease
|
|
./scripts/run_pipeline.sh --disease "Parkinson's Disease" --patients --population 50
|
|
```
|
|
|
|
## Manual Setup
|
|
|
|
1. Clone this repository:
|
|
```bash
|
|
git clone https://github.com/yourusername/synthea-alldiseases.git
|
|
cd synthea-alldiseases
|
|
```
|
|
|
|
2. Create a `.env` file with your API keys (or copy from `.env.example`):
|
|
```bash
|
|
cp .env.example .env
|
|
# Edit .env with your preferred text editor
|
|
```
|
|
|
|
3. Build and start the Docker containers:
|
|
```bash
|
|
docker-compose build
|
|
docker-compose up -d synthea
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Basic Command
|
|
|
|
```bash
|
|
nextflow run main.nf --disease_name "Disease Name" [options]
|
|
```
|
|
|
|
### Examples
|
|
|
|
Generate a module for Hypertension and create 100 patients:
|
|
```bash
|
|
nextflow run main.nf --disease_name "Hypertension" --generate_patients true --population 100 --gender 0.6
|
|
```
|
|
|
|
Generate a module for Parkinson's Disease, create 50 patients, and analyze the data:
|
|
```bash
|
|
nextflow run main.nf --disease_name "Parkinson's Disease" --generate_patients true --population 50 --analyze_patient_data true
|
|
```
|
|
|
|
### Parameters
|
|
|
|
| Parameter | Description | Default |
|
|
|-----------|-------------|---------|
|
|
| `--disease_name` | Name of the disease to model | (required) |
|
|
| `--modules_dir` | Directory for modules | `modules` |
|
|
| `--output_dir` | Directory for output files | `output` |
|
|
| `--generate_patients` | Generate patient data | `false` |
|
|
| `--population` | Number of patients to generate | `100` |
|
|
| `--gender` | Gender distribution (0-1 for % female) | `0.5` |
|
|
| `--min_age` | Minimum patient age | `0` |
|
|
| `--max_age` | Maximum patient age | `90` |
|
|
| `--seed` | Random seed for reproducibility | (random) |
|
|
| `--analyze_patient_data` | Analyze generated data | `false` |
|
|
| `--report_format` | Format for analysis report | `html` |
|
|
| `--force_generate` | Force regeneration of modules | `false` |
|
|
| `--publish_dir` | Directory for published output | `published_output` |
|
|
|
|
## Understanding the Data Flow
|
|
|
|
1. **Module Generation**: The pipeline first looks for an existing module for the specified disease. If not found, it generates one using the module_generator.
|
|
2. **Patient Generation**: If requested, the pipeline uses the actual Synthea engine to generate synthetic patient data based on the disease module.
|
|
3. **Analysis**: If requested, the pipeline analyzes the generated patient data and produces reports.
|
|
|
|
## Directory Structure
|
|
|
|
- `modules/`: Contains generated disease modules
|
|
- `module_generator/`: Contains the AI-powered module generation scripts
|
|
- `scripts/`: Utility scripts for the pipeline
|
|
- `output/`: Generated patient data (temporary)
|
|
- `published_output/`: Final output data that persists between runs
|
|
- `published_output/modules/`: Contains the generated modules
|
|
- `published_output/{disease_name}/`: Contains patient data for each disease
|
|
|
|
## Convenience Scripts
|
|
|
|
- `scripts/prepare_environment.sh`: Sets up the environment and starts containers
|
|
- `scripts/run_pipeline.sh`: Simplified interface for running the pipeline
|
|
- `scripts/analyze_patient_data.py`: Analyzes generated patient data
|
|
- `scripts/check_condition_structure.py`: Validates module JSON structure
|
|
|
|
## Troubleshooting
|
|
|
|
If you encounter issues:
|
|
|
|
1. Check that Docker containers are running:
|
|
```bash
|
|
docker ps | grep synthea
|
|
```
|
|
|
|
2. Ensure your modules directory has the required modules:
|
|
```bash
|
|
ls -la modules/
|
|
```
|
|
|
|
3. Check logs for detailed error messages:
|
|
```bash
|
|
tail -f .nextflow.log
|
|
```
|
|
|
|
4. Try rebuilding the Docker containers:
|
|
```bash
|
|
docker-compose down
|
|
docker-compose build
|
|
docker-compose up -d synthea
|
|
```
|
|
|
|
5. If module generation fails, check that your API keys are correctly set in the .env file
|
|
|
|
## License
|
|
|
|
This project uses the same license as Synthea.
|