Files
synthea-alldiseases/module_generator/README_module_generator.md

2.7 KiB

Synthea Module Generator

This tool automates the creation of disease modules for Synthea based on disease_list.json. It uses Claude 3.7 to generate appropriate JSON structures for each disease, leveraging existing modules as templates.

Prerequisites

  1. Python 3.6+
  2. Required Python packages:
    pip install anthropic tqdm
    
  3. Anthropic API key:
    export ANTHROPIC_API_KEY=your_api_key
    

Usage

# Generate 10 modules (default limit)
python module_generator.py

# Generate modules only for specific ICD-10 code categories 
python module_generator.py --diseases I20,I21,I22

# Generate up to 50 modules
python module_generator.py --limit 50

# Prioritize high-prevalence diseases (recommended)
python module_generator.py --prioritize

# Combine options for best results
python module_generator.py --diseases I,J,K --limit 100 --prioritize

How It Works

  1. The script loads the complete disease list from disease_list.json
  2. It filters out diseases that already have modules
  3. If --prioritize is enabled, it:
    • Estimates the prevalence of each disease using a heuristic scoring system
    • Prioritizes diseases based on common conditions, ICD-10 chapter, and name specificity
    • Selects the highest-scoring diseases first
  4. For each selected disease:
    • Finds the most relevant existing module as a template (based on ICD-10 code)
    • Sends a prompt to Claude with the disease details and template
    • Validates the generated JSON
    • Saves the new module to the appropriate location
    • Updates the progress tracking file

Configuration

  • CLAUDE_MODEL: Set the Claude model to use (default: claude-3-7-sonnet-20240229)
  • SYNTHEA_ROOT: Path to the Synthea root directory (auto-detected)

Cost Estimation

The script uses Claude 3.7 Sonnet, which costs approximately:

  • Input: $3 per million tokens
  • Output: $15 per million tokens

A typical generation will use:

  • ~10K input tokens (template + prompt)
  • ~5K output tokens (generated module)

At this rate, generating 1,000 modules would cost approximately:

  • Input: 10M tokens = $30
  • Output: 5M tokens = $75
  • Total: ~$105

Logging

The script logs all activity to both the console and to module_generation.log in the current directory.

Notes

  • The script includes a 1-second delay between API calls to avoid rate limits
  • Generated modules should be manually reviewed for quality and accuracy
  • You may want to run the script incrementally (e.g., by disease category) to review results
  • The script optimizes API usage by:
    • Checking if a module already exists before generating (by filename or ICD-10 code)
    • Only using Claude when a new module genuinely needs to be created
    • Prioritizing high-prevalence diseases when using the --prioritize flag