synthea-alldiseases/README.md

# Synthea All Diseases

A comprehensive pipeline for generating Synthea modules and synthetic patient data for any disease.

## Overview

This pipeline leverages Nextflow to orchestrate the generation of disease modules and synthetic patient data using Synthea. It supports:

1. Automatic generation of disease modules using Claude AI
2. Synthetic patient generation with configurable parameters using the actual Synthea engine
3. Analysis of generated patient data

## Requirements

- Docker
- Docker Compose
- Nextflow (version 20.10.0 or higher)
- Java (required by Nextflow)
- Python 3.6+ (if running scripts directly)

## Quick Start

The easiest way to get started is to use our convenience scripts:

```bash
# Set up the environment (builds Docker containers and prepares directories)
./scripts/prepare_environment.sh

# Run the pipeline for a specific disease
./scripts/run_pipeline.sh --disease "Parkinson's Disease" --patients --population 50
```

## Manual Setup

1. Clone this repository:
   ```bash
   git clone https://github.com/yourusername/synthea-alldiseases.git
   cd synthea-alldiseases
   ```

2. Create a `.env` file with your API keys (or copy from `.env.example`):
   ```bash
   cp .env.example .env
   # Edit .env with your preferred text editor
   ```

3. Build and start the Docker containers:
   ```bash
   docker-compose build
   docker-compose up -d synthea
   ```

## Usage

### Basic Command

```bash
nextflow run main.nf --disease_name "Disease Name" [options]
```

### Examples

Generate a module for Hypertension and create 100 patients:
```bash
nextflow run main.nf --disease_name "Hypertension" --generate_patients true --population 100 --gender 0.6
```

Generate a module for Parkinson's Disease, create 50 patients, and analyze the data:
```bash
nextflow run main.nf --disease_name "Parkinson's Disease" --generate_patients true --population 50 --analyze_patient_data true
```

### Parameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `--disease_name` | Name of the disease to model | (required) |
| `--modules_dir` | Directory for modules | `modules` |
| `--output_dir` | Directory for output files | `output` |
| `--generate_patients` | Generate patient data | `false` |
| `--population` | Number of patients to generate | `100` |
| `--gender` | Gender distribution (0-1 for % female) | `0.5` |
| `--min_age` | Minimum patient age | `0` |
| `--max_age` | Maximum patient age | `90` |
| `--seed` | Random seed for reproducibility | (random) |
| `--analyze_patient_data` | Analyze generated data | `false` |
| `--report_format` | Format for analysis report | `html` |
| `--force_generate` | Force regeneration of modules | `false` |
| `--publish_dir` | Directory for published output | `published_output` |

## Understanding the Data Flow

1. **Module Generation**: The pipeline first looks for an existing module for the specified disease. If not found, it generates one using the module_generator.
2. **Patient Generation**: If requested, the pipeline uses the actual Synthea engine to generate synthetic patient data based on the disease module.
3. **Analysis**: If requested, the pipeline analyzes the generated patient data and produces reports.

## Directory Structure

- `modules/`: Contains generated disease modules
- `module_generator/`: Contains the AI-powered module generation scripts
- `scripts/`: Utility scripts for the pipeline
- `output/`: Generated patient data (temporary)
- `published_output/`: Final output data that persists between runs
  - `published_output/modules/`: Contains the generated modules
  - `published_output/{disease_name}/`: Contains patient data for each disease

## Convenience Scripts

- `scripts/prepare_environment.sh`: Sets up the environment and starts containers
- `scripts/run_pipeline.sh`: Simplified interface for running the pipeline
- `scripts/analyze_patient_data.py`: Analyzes generated patient data
- `scripts/check_condition_structure.py`: Validates module JSON structure

## Troubleshooting

If you encounter issues:

1. Check that Docker containers are running:
   ```bash
   docker ps | grep synthea
   ```

2. Ensure your modules directory has the required modules:
   ```bash
   ls -la modules/
   ```

3. Check logs for detailed error messages:
   ```bash
   tail -f .nextflow.log
   ```

4. Try rebuilding the Docker containers:
   ```bash
   docker-compose down
   docker-compose build
   docker-compose up -d synthea
   ```

5. If module generation fails, check that your API keys are correctly set in the .env file

## License

This project uses the same license as Synthea.