Trying to fix basic functionality again.
This commit is contained in:
83
module_generator/README_module_generator.md
Normal file
83
module_generator/README_module_generator.md
Normal file
@@ -0,0 +1,83 @@
|
||||
# Synthea Module Generator
|
||||
|
||||
This tool automates the creation of disease modules for Synthea based on `disease_list.json`. It uses Claude 3.7 to generate appropriate JSON structures for each disease, leveraging existing modules as templates.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. Python 3.6+
|
||||
2. Required Python packages:
|
||||
```
|
||||
pip install anthropic tqdm
|
||||
```
|
||||
3. Anthropic API key:
|
||||
```
|
||||
export ANTHROPIC_API_KEY=your_api_key
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Generate 10 modules (default limit)
|
||||
python module_generator.py
|
||||
|
||||
# Generate modules only for specific ICD-10 code categories
|
||||
python module_generator.py --diseases I20,I21,I22
|
||||
|
||||
# Generate up to 50 modules
|
||||
python module_generator.py --limit 50
|
||||
|
||||
# Prioritize high-prevalence diseases (recommended)
|
||||
python module_generator.py --prioritize
|
||||
|
||||
# Combine options for best results
|
||||
python module_generator.py --diseases I,J,K --limit 100 --prioritize
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
1. The script loads the complete disease list from `disease_list.json`
|
||||
2. It filters out diseases that already have modules
|
||||
3. If `--prioritize` is enabled, it:
|
||||
- Estimates the prevalence of each disease using a heuristic scoring system
|
||||
- Prioritizes diseases based on common conditions, ICD-10 chapter, and name specificity
|
||||
- Selects the highest-scoring diseases first
|
||||
4. For each selected disease:
|
||||
- Finds the most relevant existing module as a template (based on ICD-10 code)
|
||||
- Sends a prompt to Claude with the disease details and template
|
||||
- Validates the generated JSON
|
||||
- Saves the new module to the appropriate location
|
||||
- Updates the progress tracking file
|
||||
|
||||
## Configuration
|
||||
|
||||
- `CLAUDE_MODEL`: Set the Claude model to use (default: `claude-3-7-sonnet-20240229`)
|
||||
- `SYNTHEA_ROOT`: Path to the Synthea root directory (auto-detected)
|
||||
|
||||
## Cost Estimation
|
||||
|
||||
The script uses Claude 3.7 Sonnet, which costs approximately:
|
||||
- Input: $3 per million tokens
|
||||
- Output: $15 per million tokens
|
||||
|
||||
A typical generation will use:
|
||||
- ~10K input tokens (template + prompt)
|
||||
- ~5K output tokens (generated module)
|
||||
|
||||
At this rate, generating 1,000 modules would cost approximately:
|
||||
- Input: 10M tokens = $30
|
||||
- Output: 5M tokens = $75
|
||||
- Total: ~$105
|
||||
|
||||
## Logging
|
||||
|
||||
The script logs all activity to both the console and to `module_generation.log` in the current directory.
|
||||
|
||||
## Notes
|
||||
|
||||
- The script includes a 1-second delay between API calls to avoid rate limits
|
||||
- Generated modules should be manually reviewed for quality and accuracy
|
||||
- You may want to run the script incrementally (e.g., by disease category) to review results
|
||||
- The script optimizes API usage by:
|
||||
- Checking if a module already exists before generating (by filename or ICD-10 code)
|
||||
- Only using Claude when a new module genuinely needs to be created
|
||||
- Prioritizing high-prevalence diseases when using the `--prioritize` flag
|
||||
Reference in New Issue
Block a user