Initial commit: digital-patients pipeline (clean, no large files)
Large reference/model files excluded from repo - to be staged to S3 or baked into Docker images.
This commit is contained in:
210
README.md
Normal file
210
README.md
Normal file
@@ -0,0 +1,210 @@
|
||||
# Digital Patient and Drug Response Pipeline - Comprehensive Implementation Plan
|
||||
|
||||
## Pipeline Overview
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph Patient["Patient Profile Generation"]
|
||||
A1[1. Medical Records Creation] --> A2[2. Disease-specific Genome]
|
||||
A2 --> A3[3. Disease-specific Protein Variants]
|
||||
A3 --> A4[4. Disease-specific Transcriptome]
|
||||
A4 --> A5[5. Disease-specific Proteome]
|
||||
A5 --> A6[6. Disease-specific Metabolome]
|
||||
A6 --> A7[7. Disease-specific Immunome]
|
||||
end
|
||||
|
||||
subgraph Drug["Drug Analysis & Modeling"]
|
||||
B1[8. Drug-Target PK] --> B1a[8a. Binding Site Prediction]
|
||||
B1a --> B1b[8b. Drug-Target Docking]
|
||||
B1b --> B2[9. Drug-Proteome Screening]
|
||||
B2 --> B3[10. Off-target Analysis]
|
||||
B3 --> B4[11. Drug-Compound Screening]
|
||||
B4 --> B5[12. Drug-Genome Sensitivity]
|
||||
end
|
||||
|
||||
subgraph Response["Response Prediction"]
|
||||
C1[13. Transcriptomic Changes] --> C2[14. Disease Stage Evaluation]
|
||||
C2 --> C3[15-16. Proteomic & Metabolomic Changes]
|
||||
C3 --> C4[17-19. Biological & Immune Response]
|
||||
C4 --> C5[20-21. ADMET & Toxicity]
|
||||
end
|
||||
|
||||
Patient --> Drug
|
||||
Drug --> Response
|
||||
```
|
||||
|
||||
## Part 1: Digital Patient Generation
|
||||
|
||||
### Implementation Status Overview
|
||||
|
||||
| Step | Status | Tool/Method | Input | Output | Location | Validation Data | Dependencies/Notes |
|
||||
|------|--------|-------------|--------|---------|-----------|-----------------|-------------------|
|
||||
| 1. Medical Records | ✓ | Synthea | - | Demographics, records | /Workspace/next/registry/tools/synthea | Target: 1000 patients/disease | - |
|
||||
| 2. Disease Genome | ✓ | Omic-UKBB | Alleles, positions, frequencies | VCF (hg38) | Part of Synthea repo | - | Only storing variants |
|
||||
| 3. Protein Variants | ✓ | vcf2prot | VCF | Protein fasta | Part of Synthea repo (tbc) | - | Multi-tissue support needed |
|
||||
| 4. Transcriptome | 🚧 | borzoi | Genomic sequences | RNAseq (TPM) | - | ENCODE, GTEx (E-MTAB-6814) | NIH ENCODE standards |
|
||||
| 5. Proteome | 🚧 | clei2block | RNAseq (log2-FC) | Fold-change | github.com/stasaki/clei2block | CellModelPassport, TCGA | Requires GTEx training |
|
||||
| 6. Metabolome | ⏳ | corto | RNAseq (TPM) | Metabolite profiles | github.com/federicogiorgi/corto | CCLE, NCI-60 | - |
|
||||
| 7. Immunome | ⏳ | Ecotyper | RNAseq | Cell type profiles | - | SPICA30, SPICA17 | - |
|
||||
|
||||
### Active Implementation Tasks
|
||||
|
||||
#### Transcriptome Generation
|
||||
Current goal: Establish accurate transcriptome prediction pipeline
|
||||
- [x] Implement and evaluate primary models:
|
||||
* ~~enformer~~
|
||||
* ~~basenji~~
|
||||
* borzoi ~~for RNAseq profiles~~, built on basenji and enformer
|
||||
- [x] Add SequenceModelBenchmark ridge regression - built into borzoi (tbc)
|
||||
- [ ] Validate against ENCODE standards
|
||||
- [ ] Implement GTEx validation pipeline
|
||||
|
||||
#### Multi-omic Integration
|
||||
Current goal: Create robust data transformation pipeline
|
||||
- [ ] Proteome prediction (clei2block):
|
||||
* Implement GTEx training pipeline
|
||||
* Add multi-tissue support
|
||||
* Create validation framework against CellModelPassport
|
||||
- [ ] Metabolome generation (corto):
|
||||
* Setup CCLE data integration
|
||||
* Implement NCI-60 validation
|
||||
- [ ] Immunome profiling:
|
||||
* ~~Evaluate Ecotyper vs CIBERTSORTx~~ CIBERTSORTx incorporated within Ecotyper
|
||||
* Integrate SPICA datasets
|
||||
* Setup immune cell validation pipeline
|
||||
|
||||
## Part 2: Drug Discovery and Response
|
||||
|
||||
### Drug Development Tools
|
||||
Current goal: Establish comprehensive drug analysis pipeline
|
||||
- [ ] Molecule Processing:
|
||||
* SELFIES library for biologics/peptides conversion
|
||||
* Implement molecule validation checks
|
||||
* Setup standardization pipeline
|
||||
- [ ] Structure Analysis:
|
||||
* DreamDock + ConPlex score pipeline
|
||||
* LightDock for membrane binding
|
||||
* Validation framework with crystal structures
|
||||
|
||||
### Binding Site Prediction
|
||||
Current goal: Create consensus model for binding site prediction
|
||||
- [ ] Benchmark tools:
|
||||
* DiffDock implementation and testing
|
||||
* Qvina2 evaluation
|
||||
* P2Rank integration
|
||||
* FPocket analysis
|
||||
- [ ] Specific considerations:
|
||||
* Allosteric site detection
|
||||
* Multiple binding site handling
|
||||
* Protein flexibility modeling
|
||||
- [ ] Validation:
|
||||
* BindingDB integration
|
||||
* Crystal structure comparison pipeline
|
||||
* Edge case testing suite
|
||||
|
||||
### Drug-Target Analysis
|
||||
Current goal: Robust docking and interaction prediction
|
||||
- [ ] Primary docking pipeline:
|
||||
* Uni-mol integration
|
||||
* DreamDock implementation
|
||||
* Path4Drug integration for pathways
|
||||
- [ ] Molecule type-specific handling:
|
||||
* Small molecule pipeline
|
||||
* Biologics pathway
|
||||
* PROTACs specific analysis
|
||||
* Prodrug processing
|
||||
- [ ] Interaction analysis:
|
||||
* Agonist vs antagonist classification
|
||||
* Protein-protein interaction integration
|
||||
* Chemical_checker for bioactivity signatures
|
||||
|
||||
### Chemical Property Prediction
|
||||
Current goal: Comprehensive property prediction system
|
||||
- [ ] Model implementation:
|
||||
* Chemprop evaluation
|
||||
* Soltrannet integration
|
||||
* Custom ADMET model development
|
||||
- [ ] Property coverage:
|
||||
* Solubility prediction
|
||||
* BBB penetration
|
||||
* Chemical stability
|
||||
* Metabolic processing
|
||||
|
||||
### Toxicity Prediction Pipeline
|
||||
Current goal: Multi-faceted toxicity assessment system
|
||||
- [ ] Core modules:
|
||||
* Cardiotoxicity (ion channel) prediction
|
||||
* Hepatotoxicity (Phase 1/2 proteins)
|
||||
* Nephrotoxicity assessment
|
||||
* Lung toxicity prediction
|
||||
* Neurotoxicity (BBB criteria)
|
||||
* Inflammatory response modeling
|
||||
* Bleeding/clotting risk analysis
|
||||
- [ ] Integration components:
|
||||
* Human Protein Atlas tissue proportion estimation
|
||||
* Reactome pathway analysis
|
||||
* Industry model benchmarking
|
||||
|
||||
### Drug Response Analysis
|
||||
Current goal: Integrated response prediction system
|
||||
- [ ] Transcriptomic response:
|
||||
* LINCS data integration
|
||||
* Expression change prediction
|
||||
* Tissue-specific effects
|
||||
- [ ] Multi-omic response:
|
||||
* Proteomic change modeling
|
||||
* Metabolomic adjustment prediction
|
||||
* Immune response profiling
|
||||
- [ ] Special cases:
|
||||
* Multi-drug combinations
|
||||
* Time-dependent effects
|
||||
* Population-specific responses
|
||||
|
||||
## Critical Dependencies & Requirements
|
||||
|
||||
| Category | Component | Status | Notes |
|
||||
|----------|-----------|---------|--------|
|
||||
| **External Data** | BindingDB | ✓ Available | Binding affinities |
|
||||
| | LINCS | ✓ Available | Compound effects |
|
||||
| | PharmGKB | ⏳ Pending | Variant annotations |
|
||||
| | Human Cell Atlas | ⏳ Pending | Tissue-specific data |
|
||||
| **Compute** | GPU Cluster | 🚧 Scaling | For enformer/basenji |
|
||||
| | Storage | ✓ Configured | For variant data |
|
||||
| | Distribution | ⏳ Planned | For processing |
|
||||
|
||||
## Validation Framework
|
||||
|
||||
| Dataset | Usage | Status | Notes |
|
||||
|---------|--------|---------|--------|
|
||||
| ENCODE | Transcriptomics | ✓ Ready | Primary validation |
|
||||
| GTEx | Tissue-specific | ✓ Ready | E-MTAB-6814 |
|
||||
| CCLE/GDSC2 | Cell lines | 🚧 In Progress | Cancer validation |
|
||||
| TDC | ADMET | ⏳ Planned | Benchmark data |
|
||||
| Cross-species | Conservation | ⏳ Planned | Evolutionary validation |
|
||||
| Time-series | Metabolics | ⏳ Planned | Kinetic validation |
|
||||
|
||||
## Edge Cases & Special Considerations
|
||||
|
||||
### Complex Scenarios
|
||||
| Scenario | Implementation Status | Handling Strategy |
|
||||
|----------|---------------------|-------------------|
|
||||
| Rare variants | 🚧 In Progress | Population frequency weighting |
|
||||
| Multi-drug combinations | ⏳ Planned | Interaction matrix modeling |
|
||||
| Time-dependent effects | ⏳ Planned | PK/PD time series modeling |
|
||||
| Population specificity | 🚧 In Progress | Demographic stratification |
|
||||
|
||||
### Special Drug Classes
|
||||
| Class | Special Requirements | Status |
|
||||
|-------|---------------------|---------|
|
||||
| Biologics | Membrane modeling, immunogenicity | ⏳ Planned |
|
||||
| Prodrugs | Metabolite prediction, activation | 🚧 In Progress |
|
||||
| Combination therapy | Interaction prediction, timing | ⏳ Planned |
|
||||
| PROTACs | Protein degradation modeling | ⏳ Planned |
|
||||
|
||||
## Case Studies & Validation Examples
|
||||
|
||||
| Drug | Outcome | Learning Points | Implementation Status |
|
||||
|------|---------|----------------|----------------------|
|
||||
| Amcenestrant | Efficacy failure | Target validation importance | ✓ Integrated |
|
||||
| Flupirtine | Liver toxicity | Metabolite prediction crucial | 🚧 In Progress |
|
||||
| Ranitidine | NDMA formation | Chemical stability prediction | ⏳ Planned |
|
||||
| Multi-drug Examples | Variable | Interaction modeling needed | ⏳ Planned |
|
||||
Reference in New Issue
Block a user