# Synthea Module Generator This tool automates the creation of disease modules for Synthea based on `disease_list.json`. It uses Claude 3.7 to generate appropriate JSON structures for each disease, leveraging existing modules as templates. ## Prerequisites 1. Python 3.6+ 2. Required Python packages: ``` pip install anthropic tqdm ``` 3. Anthropic API key: ``` export ANTHROPIC_API_KEY=your_api_key ``` ## Usage ```bash # Generate 10 modules (default limit) python module_generator.py # Generate modules only for specific ICD-10 code categories python module_generator.py --diseases I20,I21,I22 # Generate up to 50 modules python module_generator.py --limit 50 # Prioritize high-prevalence diseases (recommended) python module_generator.py --prioritize # Combine options for best results python module_generator.py --diseases I,J,K --limit 100 --prioritize ``` ## How It Works 1. The script loads the complete disease list from `disease_list.json` 2. It filters out diseases that already have modules 3. If `--prioritize` is enabled, it: - Estimates the prevalence of each disease using a heuristic scoring system - Prioritizes diseases based on common conditions, ICD-10 chapter, and name specificity - Selects the highest-scoring diseases first 4. For each selected disease: - Finds the most relevant existing module as a template (based on ICD-10 code) - Sends a prompt to Claude with the disease details and template - Validates the generated JSON - Saves the new module to the appropriate location - Updates the progress tracking file ## Configuration - `CLAUDE_MODEL`: Set the Claude model to use (default: `claude-3-7-sonnet-20240229`) - `SYNTHEA_ROOT`: Path to the Synthea root directory (auto-detected) ## Cost Estimation The script uses Claude 3.7 Sonnet, which costs approximately: - Input: $3 per million tokens - Output: $15 per million tokens A typical generation will use: - ~10K input tokens (template + prompt) - ~5K output tokens (generated module) At this rate, generating 1,000 modules would cost approximately: - Input: 10M tokens = $30 - Output: 5M tokens = $75 - Total: ~$105 ## Logging The script logs all activity to both the console and to `module_generation.log` in the current directory. ## Notes - The script includes a 1-second delay between API calls to avoid rate limits - Generated modules should be manually reviewed for quality and accuracy - You may want to run the script incrementally (e.g., by disease category) to review results - The script optimizes API usage by: - Checking if a module already exists before generating (by filename or ICD-10 code) - Only using Claude when a new module genuinely needs to be created - Prioritizing high-prevalence diseases when using the `--prioritize` flag