83 lines
2.7 KiB
Markdown
83 lines
2.7 KiB
Markdown
# Synthea Module Generator
|
|
|
|
This tool automates the creation of disease modules for Synthea based on `disease_list.json`. It uses Claude 3.7 to generate appropriate JSON structures for each disease, leveraging existing modules as templates.
|
|
|
|
## Prerequisites
|
|
|
|
1. Python 3.6+
|
|
2. Required Python packages:
|
|
```
|
|
pip install anthropic tqdm
|
|
```
|
|
3. Anthropic API key:
|
|
```
|
|
export ANTHROPIC_API_KEY=your_api_key
|
|
```
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
# Generate 10 modules (default limit)
|
|
python module_generator.py
|
|
|
|
# Generate modules only for specific ICD-10 code categories
|
|
python module_generator.py --diseases I20,I21,I22
|
|
|
|
# Generate up to 50 modules
|
|
python module_generator.py --limit 50
|
|
|
|
# Prioritize high-prevalence diseases (recommended)
|
|
python module_generator.py --prioritize
|
|
|
|
# Combine options for best results
|
|
python module_generator.py --diseases I,J,K --limit 100 --prioritize
|
|
```
|
|
|
|
## How It Works
|
|
|
|
1. The script loads the complete disease list from `disease_list.json`
|
|
2. It filters out diseases that already have modules
|
|
3. If `--prioritize` is enabled, it:
|
|
- Estimates the prevalence of each disease using a heuristic scoring system
|
|
- Prioritizes diseases based on common conditions, ICD-10 chapter, and name specificity
|
|
- Selects the highest-scoring diseases first
|
|
4. For each selected disease:
|
|
- Finds the most relevant existing module as a template (based on ICD-10 code)
|
|
- Sends a prompt to Claude with the disease details and template
|
|
- Validates the generated JSON
|
|
- Saves the new module to the appropriate location
|
|
- Updates the progress tracking file
|
|
|
|
## Configuration
|
|
|
|
- `CLAUDE_MODEL`: Set the Claude model to use (default: `claude-3-7-sonnet-20240229`)
|
|
- `SYNTHEA_ROOT`: Path to the Synthea root directory (auto-detected)
|
|
|
|
## Cost Estimation
|
|
|
|
The script uses Claude 3.7 Sonnet, which costs approximately:
|
|
- Input: $3 per million tokens
|
|
- Output: $15 per million tokens
|
|
|
|
A typical generation will use:
|
|
- ~10K input tokens (template + prompt)
|
|
- ~5K output tokens (generated module)
|
|
|
|
At this rate, generating 1,000 modules would cost approximately:
|
|
- Input: 10M tokens = $30
|
|
- Output: 5M tokens = $75
|
|
- Total: ~$105
|
|
|
|
## Logging
|
|
|
|
The script logs all activity to both the console and to `module_generation.log` in the current directory.
|
|
|
|
## Notes
|
|
|
|
- The script includes a 1-second delay between API calls to avoid rate limits
|
|
- Generated modules should be manually reviewed for quality and accuracy
|
|
- You may want to run the script incrementally (e.g., by disease category) to review results
|
|
- The script optimizes API usage by:
|
|
- Checking if a module already exists before generating (by filename or ICD-10 code)
|
|
- Only using Claude when a new module genuinely needs to be created
|
|
- Prioritizing high-prevalence diseases when using the `--prioritize` flag |