synthea-alldiseases/module_generator/README_module_generator.md

# Synthea Module Generator

This tool automates the creation of disease modules for Synthea based on `disease_list.json`. It uses Claude 3.7 to generate appropriate JSON structures for each disease, leveraging existing modules as templates.

## Prerequisites

1. Python 3.6+
2. Required Python packages:
   ```
   pip install anthropic tqdm
   ```
3. Anthropic API key:
   ```
   export ANTHROPIC_API_KEY=your_api_key
   ```

## Usage

```bash
# Generate 10 modules (default limit)
python module_generator.py

# Generate modules only for specific ICD-10 code categories
python module_generator.py --diseases I20,I21,I22

# Generate up to 50 modules
python module_generator.py --limit 50

# Prioritize high-prevalence diseases (recommended)
python module_generator.py --prioritize

# Combine options for best results
python module_generator.py --diseases I,J,K --limit 100 --prioritize
```

## How It Works

1. The script loads the complete disease list from `disease_list.json`
2. It filters out diseases that already have modules
3. If `--prioritize` is enabled, it:
   - Estimates the prevalence of each disease using a heuristic scoring system
   - Prioritizes diseases based on common conditions, ICD-10 chapter, and name specificity
   - Selects the highest-scoring diseases first
4. For each selected disease:
   - Finds the most relevant existing module as a template (based on ICD-10 code)
   - Sends a prompt to Claude with the disease details and template
   - Validates the generated JSON
   - Saves the new module to the appropriate location
   - Updates the progress tracking file

## Configuration

- `CLAUDE_MODEL`: Set the Claude model to use (default: `claude-3-7-sonnet-20240229`)
- `SYNTHEA_ROOT`: Path to the Synthea root directory (auto-detected)

## Cost Estimation

The script uses Claude 3.7 Sonnet, which costs approximately:
- Input: $3 per million tokens
- Output: $15 per million tokens

A typical generation will use:
- ~10K input tokens (template + prompt)
- ~5K output tokens (generated module)

At this rate, generating 1,000 modules would cost approximately:
- Input: 10M tokens = $30
- Output: 5M tokens = $75
- Total: ~$105

## Logging

The script logs all activity to both the console and to `module_generation.log` in the current directory.

## Notes

- The script includes a 1-second delay between API calls to avoid rate limits
- Generated modules should be manually reviewed for quality and accuracy
- You may want to run the script incrementally (e.g., by disease category) to review results
- The script optimizes API usage by:
  - Checking if a module already exists before generating (by filename or ICD-10 code)
  - Only using Claude when a new module genuinely needs to be created
  - Prioritizing high-prevalence diseases when using the `--prioritize` flag