Initial commit: digital-patients pipeline (clean, no large files)

Large reference/model files excluded from repo - to be staged to S3 or baked into Docker images.
This commit is contained in:
2026-03-26 15:15:23 +01:00
commit 9e6a16c19b
45 changed files with 7207 additions and 0 deletions

210
README.md Normal file
View File

@@ -0,0 +1,210 @@
# Digital Patient and Drug Response Pipeline - Comprehensive Implementation Plan
## Pipeline Overview
```mermaid
flowchart TB
subgraph Patient["Patient Profile Generation"]
A1[1. Medical Records Creation] --> A2[2. Disease-specific Genome]
A2 --> A3[3. Disease-specific Protein Variants]
A3 --> A4[4. Disease-specific Transcriptome]
A4 --> A5[5. Disease-specific Proteome]
A5 --> A6[6. Disease-specific Metabolome]
A6 --> A7[7. Disease-specific Immunome]
end
subgraph Drug["Drug Analysis & Modeling"]
B1[8. Drug-Target PK] --> B1a[8a. Binding Site Prediction]
B1a --> B1b[8b. Drug-Target Docking]
B1b --> B2[9. Drug-Proteome Screening]
B2 --> B3[10. Off-target Analysis]
B3 --> B4[11. Drug-Compound Screening]
B4 --> B5[12. Drug-Genome Sensitivity]
end
subgraph Response["Response Prediction"]
C1[13. Transcriptomic Changes] --> C2[14. Disease Stage Evaluation]
C2 --> C3[15-16. Proteomic & Metabolomic Changes]
C3 --> C4[17-19. Biological & Immune Response]
C4 --> C5[20-21. ADMET & Toxicity]
end
Patient --> Drug
Drug --> Response
```
## Part 1: Digital Patient Generation
### Implementation Status Overview
| Step | Status | Tool/Method | Input | Output | Location | Validation Data | Dependencies/Notes |
|------|--------|-------------|--------|---------|-----------|-----------------|-------------------|
| 1. Medical Records | ✓ | Synthea | - | Demographics, records | /Workspace/next/registry/tools/synthea | Target: 1000 patients/disease | - |
| 2. Disease Genome | ✓ | Omic-UKBB | Alleles, positions, frequencies | VCF (hg38) | Part of Synthea repo | - | Only storing variants |
| 3. Protein Variants | ✓ | vcf2prot | VCF | Protein fasta | Part of Synthea repo (tbc) | - | Multi-tissue support needed |
| 4. Transcriptome | 🚧 | borzoi | Genomic sequences | RNAseq (TPM) | - | ENCODE, GTEx (E-MTAB-6814) | NIH ENCODE standards |
| 5. Proteome | 🚧 | clei2block | RNAseq (log2-FC) | Fold-change | github.com/stasaki/clei2block | CellModelPassport, TCGA | Requires GTEx training |
| 6. Metabolome | ⏳ | corto | RNAseq (TPM) | Metabolite profiles | github.com/federicogiorgi/corto | CCLE, NCI-60 | - |
| 7. Immunome | ⏳ | Ecotyper | RNAseq | Cell type profiles | - | SPICA30, SPICA17 | - |
### Active Implementation Tasks
#### Transcriptome Generation
Current goal: Establish accurate transcriptome prediction pipeline
- [x] Implement and evaluate primary models:
* ~~enformer~~
* ~~basenji~~
* borzoi ~~for RNAseq profiles~~, built on basenji and enformer
- [x] Add SequenceModelBenchmark ridge regression - built into borzoi (tbc)
- [ ] Validate against ENCODE standards
- [ ] Implement GTEx validation pipeline
#### Multi-omic Integration
Current goal: Create robust data transformation pipeline
- [ ] Proteome prediction (clei2block):
* Implement GTEx training pipeline
* Add multi-tissue support
* Create validation framework against CellModelPassport
- [ ] Metabolome generation (corto):
* Setup CCLE data integration
* Implement NCI-60 validation
- [ ] Immunome profiling:
* ~~Evaluate Ecotyper vs CIBERTSORTx~~ CIBERTSORTx incorporated within Ecotyper
* Integrate SPICA datasets
* Setup immune cell validation pipeline
## Part 2: Drug Discovery and Response
### Drug Development Tools
Current goal: Establish comprehensive drug analysis pipeline
- [ ] Molecule Processing:
* SELFIES library for biologics/peptides conversion
* Implement molecule validation checks
* Setup standardization pipeline
- [ ] Structure Analysis:
* DreamDock + ConPlex score pipeline
* LightDock for membrane binding
* Validation framework with crystal structures
### Binding Site Prediction
Current goal: Create consensus model for binding site prediction
- [ ] Benchmark tools:
* DiffDock implementation and testing
* Qvina2 evaluation
* P2Rank integration
* FPocket analysis
- [ ] Specific considerations:
* Allosteric site detection
* Multiple binding site handling
* Protein flexibility modeling
- [ ] Validation:
* BindingDB integration
* Crystal structure comparison pipeline
* Edge case testing suite
### Drug-Target Analysis
Current goal: Robust docking and interaction prediction
- [ ] Primary docking pipeline:
* Uni-mol integration
* DreamDock implementation
* Path4Drug integration for pathways
- [ ] Molecule type-specific handling:
* Small molecule pipeline
* Biologics pathway
* PROTACs specific analysis
* Prodrug processing
- [ ] Interaction analysis:
* Agonist vs antagonist classification
* Protein-protein interaction integration
* Chemical_checker for bioactivity signatures
### Chemical Property Prediction
Current goal: Comprehensive property prediction system
- [ ] Model implementation:
* Chemprop evaluation
* Soltrannet integration
* Custom ADMET model development
- [ ] Property coverage:
* Solubility prediction
* BBB penetration
* Chemical stability
* Metabolic processing
### Toxicity Prediction Pipeline
Current goal: Multi-faceted toxicity assessment system
- [ ] Core modules:
* Cardiotoxicity (ion channel) prediction
* Hepatotoxicity (Phase 1/2 proteins)
* Nephrotoxicity assessment
* Lung toxicity prediction
* Neurotoxicity (BBB criteria)
* Inflammatory response modeling
* Bleeding/clotting risk analysis
- [ ] Integration components:
* Human Protein Atlas tissue proportion estimation
* Reactome pathway analysis
* Industry model benchmarking
### Drug Response Analysis
Current goal: Integrated response prediction system
- [ ] Transcriptomic response:
* LINCS data integration
* Expression change prediction
* Tissue-specific effects
- [ ] Multi-omic response:
* Proteomic change modeling
* Metabolomic adjustment prediction
* Immune response profiling
- [ ] Special cases:
* Multi-drug combinations
* Time-dependent effects
* Population-specific responses
## Critical Dependencies & Requirements
| Category | Component | Status | Notes |
|----------|-----------|---------|--------|
| **External Data** | BindingDB | ✓ Available | Binding affinities |
| | LINCS | ✓ Available | Compound effects |
| | PharmGKB | ⏳ Pending | Variant annotations |
| | Human Cell Atlas | ⏳ Pending | Tissue-specific data |
| **Compute** | GPU Cluster | 🚧 Scaling | For enformer/basenji |
| | Storage | ✓ Configured | For variant data |
| | Distribution | ⏳ Planned | For processing |
## Validation Framework
| Dataset | Usage | Status | Notes |
|---------|--------|---------|--------|
| ENCODE | Transcriptomics | ✓ Ready | Primary validation |
| GTEx | Tissue-specific | ✓ Ready | E-MTAB-6814 |
| CCLE/GDSC2 | Cell lines | 🚧 In Progress | Cancer validation |
| TDC | ADMET | ⏳ Planned | Benchmark data |
| Cross-species | Conservation | ⏳ Planned | Evolutionary validation |
| Time-series | Metabolics | ⏳ Planned | Kinetic validation |
## Edge Cases & Special Considerations
### Complex Scenarios
| Scenario | Implementation Status | Handling Strategy |
|----------|---------------------|-------------------|
| Rare variants | 🚧 In Progress | Population frequency weighting |
| Multi-drug combinations | ⏳ Planned | Interaction matrix modeling |
| Time-dependent effects | ⏳ Planned | PK/PD time series modeling |
| Population specificity | 🚧 In Progress | Demographic stratification |
### Special Drug Classes
| Class | Special Requirements | Status |
|-------|---------------------|---------|
| Biologics | Membrane modeling, immunogenicity | ⏳ Planned |
| Prodrugs | Metabolite prediction, activation | 🚧 In Progress |
| Combination therapy | Interaction prediction, timing | ⏳ Planned |
| PROTACs | Protein degradation modeling | ⏳ Planned |
## Case Studies & Validation Examples
| Drug | Outcome | Learning Points | Implementation Status |
|------|---------|----------------|----------------------|
| Amcenestrant | Efficacy failure | Target validation importance | ✓ Integrated |
| Flupirtine | Liver toxicity | Metabolite prediction crucial | 🚧 In Progress |
| Ranitidine | NDMA formation | Chemical stability prediction | ⏳ Planned |
| Multi-drug Examples | Variable | Interaction modeling needed | ⏳ Planned |