Trying to fix basic functionality again.

This commit is contained in:
2025-03-23 11:53:47 -07:00
parent ebda48190a
commit 2141e81f42
406 changed files with 173963 additions and 69 deletions

View File

@@ -0,0 +1,83 @@
# Synthea Module Generator
This tool automates the creation of disease modules for Synthea based on `disease_list.json`. It uses Claude 3.7 to generate appropriate JSON structures for each disease, leveraging existing modules as templates.
## Prerequisites
1. Python 3.6+
2. Required Python packages:
```
pip install anthropic tqdm
```
3. Anthropic API key:
```
export ANTHROPIC_API_KEY=your_api_key
```
## Usage
```bash
# Generate 10 modules (default limit)
python module_generator.py
# Generate modules only for specific ICD-10 code categories
python module_generator.py --diseases I20,I21,I22
# Generate up to 50 modules
python module_generator.py --limit 50
# Prioritize high-prevalence diseases (recommended)
python module_generator.py --prioritize
# Combine options for best results
python module_generator.py --diseases I,J,K --limit 100 --prioritize
```
## How It Works
1. The script loads the complete disease list from `disease_list.json`
2. It filters out diseases that already have modules
3. If `--prioritize` is enabled, it:
- Estimates the prevalence of each disease using a heuristic scoring system
- Prioritizes diseases based on common conditions, ICD-10 chapter, and name specificity
- Selects the highest-scoring diseases first
4. For each selected disease:
- Finds the most relevant existing module as a template (based on ICD-10 code)
- Sends a prompt to Claude with the disease details and template
- Validates the generated JSON
- Saves the new module to the appropriate location
- Updates the progress tracking file
## Configuration
- `CLAUDE_MODEL`: Set the Claude model to use (default: `claude-3-7-sonnet-20240229`)
- `SYNTHEA_ROOT`: Path to the Synthea root directory (auto-detected)
## Cost Estimation
The script uses Claude 3.7 Sonnet, which costs approximately:
- Input: $3 per million tokens
- Output: $15 per million tokens
A typical generation will use:
- ~10K input tokens (template + prompt)
- ~5K output tokens (generated module)
At this rate, generating 1,000 modules would cost approximately:
- Input: 10M tokens = $30
- Output: 5M tokens = $75
- Total: ~$105
## Logging
The script logs all activity to both the console and to `module_generation.log` in the current directory.
## Notes
- The script includes a 1-second delay between API calls to avoid rate limits
- Generated modules should be manually reviewed for quality and accuracy
- You may want to run the script incrementally (e.g., by disease category) to review results
- The script optimizes API usage by:
- Checking if a module already exists before generating (by filename or ICD-10 code)
- Only using Claude when a new module genuinely needs to be created
- Prioritizing high-prevalence diseases when using the `--prioritize` flag