Olamide Isreal 9e6a16c19b Initial commit: digital-patients pipeline (clean, no large files)
Large reference/model files excluded from repo - to be staged to S3 or baked into Docker images.
2026-03-26 15:15:23 +01:00

Digital Patient and Drug Response Pipeline - Comprehensive Implementation Plan

Pipeline Overview

flowchart TB
    subgraph Patient["Patient Profile Generation"]
        A1[1. Medical Records Creation] --> A2[2. Disease-specific Genome]
        A2 --> A3[3. Disease-specific Protein Variants]
        A3 --> A4[4. Disease-specific Transcriptome]
        A4 --> A5[5. Disease-specific Proteome]
        A5 --> A6[6. Disease-specific Metabolome]
        A6 --> A7[7. Disease-specific Immunome]
    end

    subgraph Drug["Drug Analysis & Modeling"]
        B1[8. Drug-Target PK] --> B1a[8a. Binding Site Prediction]
        B1a --> B1b[8b. Drug-Target Docking]
        B1b --> B2[9. Drug-Proteome Screening]
        B2 --> B3[10. Off-target Analysis]
        B3 --> B4[11. Drug-Compound Screening]
        B4 --> B5[12. Drug-Genome Sensitivity]
    end

    subgraph Response["Response Prediction"]
        C1[13. Transcriptomic Changes] --> C2[14. Disease Stage Evaluation]
        C2 --> C3[15-16. Proteomic & Metabolomic Changes]
        C3 --> C4[17-19. Biological & Immune Response]
        C4 --> C5[20-21. ADMET & Toxicity]
    end

    Patient --> Drug
    Drug --> Response

Part 1: Digital Patient Generation

Implementation Status Overview

Step Status Tool/Method Input Output Location Validation Data Dependencies/Notes
1. Medical Records Synthea - Demographics, records /Workspace/next/registry/tools/synthea Target: 1000 patients/disease -
2. Disease Genome Omic-UKBB Alleles, positions, frequencies VCF (hg38) Part of Synthea repo - Only storing variants
3. Protein Variants vcf2prot VCF Protein fasta Part of Synthea repo (tbc) - Multi-tissue support needed
4. Transcriptome 🚧 borzoi Genomic sequences RNAseq (TPM) - ENCODE, GTEx (E-MTAB-6814) NIH ENCODE standards
5. Proteome 🚧 clei2block RNAseq (log2-FC) Fold-change github.com/stasaki/clei2block CellModelPassport, TCGA Requires GTEx training
6. Metabolome corto RNAseq (TPM) Metabolite profiles github.com/federicogiorgi/corto CCLE, NCI-60 -
7. Immunome Ecotyper RNAseq Cell type profiles - SPICA30, SPICA17 -

Active Implementation Tasks

Transcriptome Generation

Current goal: Establish accurate transcriptome prediction pipeline

  • Implement and evaluate primary models:
    • enformer
    • basenji
    • borzoi for RNAseq profiles, built on basenji and enformer
  • Add SequenceModelBenchmark ridge regression - built into borzoi (tbc)
  • Validate against ENCODE standards
  • Implement GTEx validation pipeline

Multi-omic Integration

Current goal: Create robust data transformation pipeline

  • Proteome prediction (clei2block):
    • Implement GTEx training pipeline
    • Add multi-tissue support
    • Create validation framework against CellModelPassport
  • Metabolome generation (corto):
    • Setup CCLE data integration
    • Implement NCI-60 validation
  • Immunome profiling:
    • Evaluate Ecotyper vs CIBERTSORTx CIBERTSORTx incorporated within Ecotyper
    • Integrate SPICA datasets
    • Setup immune cell validation pipeline

Part 2: Drug Discovery and Response

Drug Development Tools

Current goal: Establish comprehensive drug analysis pipeline

  • Molecule Processing:
    • SELFIES library for biologics/peptides conversion
    • Implement molecule validation checks
    • Setup standardization pipeline
  • Structure Analysis:
    • DreamDock + ConPlex score pipeline
    • LightDock for membrane binding
    • Validation framework with crystal structures

Binding Site Prediction

Current goal: Create consensus model for binding site prediction

  • Benchmark tools:
    • DiffDock implementation and testing
    • Qvina2 evaluation
    • P2Rank integration
    • FPocket analysis
  • Specific considerations:
    • Allosteric site detection
    • Multiple binding site handling
    • Protein flexibility modeling
  • Validation:
    • BindingDB integration
    • Crystal structure comparison pipeline
    • Edge case testing suite

Drug-Target Analysis

Current goal: Robust docking and interaction prediction

  • Primary docking pipeline:
    • Uni-mol integration
    • DreamDock implementation
    • Path4Drug integration for pathways
  • Molecule type-specific handling:
    • Small molecule pipeline
    • Biologics pathway
    • PROTACs specific analysis
    • Prodrug processing
  • Interaction analysis:
    • Agonist vs antagonist classification
    • Protein-protein interaction integration
    • Chemical_checker for bioactivity signatures

Chemical Property Prediction

Current goal: Comprehensive property prediction system

  • Model implementation:
    • Chemprop evaluation
    • Soltrannet integration
    • Custom ADMET model development
  • Property coverage:
    • Solubility prediction
    • BBB penetration
    • Chemical stability
    • Metabolic processing

Toxicity Prediction Pipeline

Current goal: Multi-faceted toxicity assessment system

  • Core modules:
    • Cardiotoxicity (ion channel) prediction
    • Hepatotoxicity (Phase 1/2 proteins)
    • Nephrotoxicity assessment
    • Lung toxicity prediction
    • Neurotoxicity (BBB criteria)
    • Inflammatory response modeling
    • Bleeding/clotting risk analysis
  • Integration components:
    • Human Protein Atlas tissue proportion estimation
    • Reactome pathway analysis
    • Industry model benchmarking

Drug Response Analysis

Current goal: Integrated response prediction system

  • Transcriptomic response:
    • LINCS data integration
    • Expression change prediction
    • Tissue-specific effects
  • Multi-omic response:
    • Proteomic change modeling
    • Metabolomic adjustment prediction
    • Immune response profiling
  • Special cases:
    • Multi-drug combinations
    • Time-dependent effects
    • Population-specific responses

Critical Dependencies & Requirements

Category Component Status Notes
External Data BindingDB ✓ Available Binding affinities
LINCS ✓ Available Compound effects
PharmGKB Pending Variant annotations
Human Cell Atlas Pending Tissue-specific data
Compute GPU Cluster 🚧 Scaling For enformer/basenji
Storage ✓ Configured For variant data
Distribution Planned For processing

Validation Framework

Dataset Usage Status Notes
ENCODE Transcriptomics ✓ Ready Primary validation
GTEx Tissue-specific ✓ Ready E-MTAB-6814
CCLE/GDSC2 Cell lines 🚧 In Progress Cancer validation
TDC ADMET Planned Benchmark data
Cross-species Conservation Planned Evolutionary validation
Time-series Metabolics Planned Kinetic validation

Edge Cases & Special Considerations

Complex Scenarios

Scenario Implementation Status Handling Strategy
Rare variants 🚧 In Progress Population frequency weighting
Multi-drug combinations Planned Interaction matrix modeling
Time-dependent effects Planned PK/PD time series modeling
Population specificity 🚧 In Progress Demographic stratification

Special Drug Classes

Class Special Requirements Status
Biologics Membrane modeling, immunogenicity Planned
Prodrugs Metabolite prediction, activation 🚧 In Progress
Combination therapy Interaction prediction, timing Planned
PROTACs Protein degradation modeling Planned

Case Studies & Validation Examples

Drug Outcome Learning Points Implementation Status
Amcenestrant Efficacy failure Target validation importance ✓ Integrated
Flupirtine Liver toxicity Metabolite prediction crucial 🚧 In Progress
Ranitidine NDMA formation Chemical stability prediction Planned
Multi-drug Examples Variable Interaction modeling needed Planned
Description
Digital patients pipeline
Readme 252 KiB
Languages
Nextflow 69.2%
Roff 15.5%
Jupyter Notebook 9.4%
Python 5.9%