# Digital Patient Pipeline: Complete Documentation
## Table of Contents
1. [Introduction](#introduction)
2. [Pipeline Overview](#pipeline-overview)
3. [Biological Background](#biological-background)
4. [Pipeline Components](#pipeline-components)
5. [Workflow Execution](#workflow-execution)
6. [Technical Architecture](#technical-architecture)
7. [Outputs and Applications](#outputs-and-applications)
---
## Introduction
This document provides an exhaustive explanation of a **Digital Patient Pipeline** - a sophisticated bioinformatics workflow that generates synthetic patient data and predicts multiple layers of molecular biology from genomic variants. The pipeline is implemented using **Nextflow**, a workflow orchestration system, and integrates multiple cutting-edge computational biology tools to simulate how genetic mutations affect gene expression, protein production, immune cell composition, and metabolic activity.
### Purpose
The Digital Patient Pipeline serves several critical purposes:
- **Synthetic Patient Generation**: Creates realistic but synthetic patient profiles with genetic variants associated with specific diseases
- **Multi-Omic Prediction**: Predicts gene expression (RNA), protein abundance, immune cell composition, and metabolic activity from DNA sequence alone
- **Clinical Research**: Enables researchers to study disease mechanisms without accessing sensitive patient data
- **Personalized Medicine**: Models how individual genetic variants affect molecular phenotypes
---
## Pipeline Overview
```mermaid
flowchart TD
Start([Start Pipeline]) --> Decision{Patient Type?}
Decision -->|Disease| Synthea[Synthea: Generate Disease Patients]
Decision -->|Healthy| Healthy[Synthea: Generate Healthy Patients]
Synthea --> VCF[VCF Files: Genetic Variants]
Healthy --> VCF
VCF --> FilterVCF[Filter VCF: Extract Coding Variants]
VCF --> VCF2Prot[VCF2Prot: Generate Mutated Proteins]
FilterVCF --> Borzoi[Borzoi: Predict RNA Expression]
Borzoi --> RNA2Prot[RNA2Protein: Predict Protein Expression]
Borzoi --> CORTO[CORTO: Predict Metabolome]
Borzoi --> CIBERSORTx[CIBERSORTx: Predict Immune Cells]
RNA2Prot --> Output1[Protein Expression Profiles]
CORTO --> Output2[Metabolic Activity Profiles]
CIBERSORTx --> Output3[Immune Cell Composition]
VCF2Prot --> Output4[Mutated Protein Sequences]
Output1 --> End([Complete Digital Patient])
Output2 --> End
Output3 --> End
Output4 --> End
style Start fill:#90EE90
style End fill:#FFB6C1
style Borzoi fill:#87CEEB
style Synthea fill:#DDA0DD
```
### Pipeline Flow Summary
1. **Patient Generation**: Synthea generates synthetic patients with realistic genetic variants
2. **Variant Processing**: VCF files containing genetic mutations are filtered and processed
3. **RNA Expression Prediction**: Borzoi predicts how mutations affect gene expression
4. **Downstream Analysis**: Multiple tools analyze predicted RNA to generate comprehensive molecular profiles
5. **Integration**: Results are combined to create a complete "digital patient"
---
## Biological Background
To understand this pipeline, we need to understand the central dogma of molecular biology and how genetic information flows through biological systems.
### The Central Dogma: DNA → RNA → Protein
```mermaid
flowchart LR
DNA[DNA: Genetic Code] -->|Transcription| RNA[RNA: Message]
RNA -->|Translation| Protein[Protein: Function]
Protein --> Phenotype[Cellular Phenotype]
style DNA fill:#FFE4B5
style RNA fill:#E0FFFF
style Protein fill:#FFE4E1
style Phenotype fill:#F0E68C
```
#### 1. DNA (Deoxyribonucleic Acid)
**DNA** is the blueprint of life, containing the genetic instructions for all cellular functions. DNA consists of:
- **Four nucleotide bases**: Adenine (A), Thymine (T), Guanine (G), Cytosine (C)
- **Double helix structure**: Two complementary strands wound together
- **Genes**: Specific segments of DNA that encode instructions for proteins
**Example DNA Sequence**: `ATGCGATCCGTA`
#### 2. RNA (Ribonucleic Acid)
During **transcription**, DNA is copied into RNA:
- **RNA polymerase** enzyme reads DNA and creates a complementary RNA strand
- RNA uses **Uracil (U)** instead of Thymine (T)
- The RNA carries the genetic message from the nucleus to protein-making machinery
**Example RNA Sequence**: `AUGCGAUCCGUA` (from DNA above)
#### 3. Proteins
During **translation**, RNA is decoded to build proteins:
- **Ribosomes** read RNA in groups of three bases called **codons**
- Each codon specifies one **amino acid**
- Amino acids chain together to form proteins
**Example**: `AUG` → Methionine, `CGA` → Arginine, `UCC` → Serine, `GUA` → Valine
**Protein**: Methionine-Arginine-Serine-Valine
### Gene Structure
Genes in eukaryotes (organisms with nuclei, like humans) have a complex structure:
```mermaid
flowchart LR
subgraph Gene Structure
Promoter[Promoter: -1000bp] --> UTR5[5' UTR]
UTR5 --> Exon1[Exon 1]
Exon1 --> Intron1[Intron 1]
Intron1 --> Exon2[Exon 2]
Exon2 --> Intron2[Intron 2]
Intron2 --> Exon3[Exon 3]
Exon3 --> UTR3[3' UTR]
end
Exon1 -.->|Splicing| mRNA[Mature mRNA]
Exon2 -.-> mRNA
Exon3 -.-> mRNA
style Exon1 fill:#90EE90
style Exon2 fill:#90EE90
style Exon3 fill:#90EE90
style Intron1 fill:#FFB6C1
style Intron2 fill:#FFB6C1
style Promoter fill:#FFD700
```
**Key Components:**
- **Promoter**: Regulatory region upstream of gene (-1000 bp) that controls when the gene is turned on
- **5' UTR (Untranslated Region)**: Beginning of RNA that isn't translated into protein
- **Exons**: Segments that ARE kept in the final RNA and code for protein
- **Introns**: Segments that are REMOVED during RNA processing
- **3' UTR**: End region of RNA that isn't translated
**Why this matters**: The Borzoi model in this pipeline predicts which parts of genes will be transcribed into RNA, including both exons and introns, before they're processed.
### Genetic Variants and Their Effects
```mermaid
flowchart TD
Variant[Genetic Variant] --> Type{Type?}
Type -->|SNP| SNP[Single Nucleotide
Polymorphism:
A->G]
Type -->|Insertion| INS[Insertion:
ATCG->ATCGGG]
Type -->|Deletion| DEL[Deletion:
ATCG->A]
SNP --> Effect1{Location?}
INS --> Effect1
DEL --> Effect1
Effect1 -->|Promoter| E1[Changes expression level]
Effect1 -->|Exon| E2[Changes protein sequence]
Effect1 -->|Splice site| E3[Changes splicing pattern]
Effect1 -->|Intron| E4[May affect regulation]
style Variant fill:#FFB6C1
style E1 fill:#87CEEB
style E2 fill:#87CEEB
style E3 fill:#87CEEB
style E4 fill:#87CEEB
```
**Variants** are differences in DNA sequence between individuals:
- **SNP (Single Nucleotide Polymorphism)**: Single base change (e.g., A→G)
- **Insertion**: Extra bases added
- **Deletion**: Bases removed
- **Structural Variant**: Large-scale DNA rearrangements
These variants can affect:
- **Gene expression**: How much RNA is made
- **Protein sequence**: Which amino acids are in the protein
- **Splicing**: Which exons are included in mature RNA
### VCF Format: Storing Genetic Variants
The **VCF (Variant Call Format)** is a standardized text file format that stores genetic variants:
```
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 12345 rs123 A G 100 PASS DP=50
chr2 67890 rs456 TC T 95 PASS DP=45
```
**Columns explained:**
- **CHROM**: Chromosome where variant is located (chr1, chr2, etc.)
- **POS**: Position on the chromosome (base pair number)
- **ID**: Database identifier (often from dbSNP database)
- **REF**: Reference base(s) at this position
- **ALT**: Alternate base(s) - the variant
- **QUAL**: Quality score for the variant call
- **FILTER**: Whether variant passed quality filters
- **INFO**: Additional information (e.g., read depth)
---
## Pipeline Components
### 1. Synthea: Synthetic Patient Generator
```mermaid
flowchart LR
Input[Input Parameters] --> Synthea{Synthea Engine}
subgraph Input Parameters
Disease[Disease Type:
schizophrenia, cancer, etc.]
Demographics[Demographics:
Age, Gender, Location]
N[Number of Patients]
end
Synthea --> UKBB[UK Biobank
Genetic Database]
UKBB --> Disease_Variants[Disease-Associated
Genetic Variants]
Disease_Variants --> VCF_Out[VCF Files
Patient_001.vcf
Patient_002.vcf]
style Synthea fill:#DDA0DD
style VCF_Out fill:#90EE90
```
**What is Synthea?**
Synthea™ is an open-source synthetic patient generator that creates realistic but completely fake patient data. It models:
- Medical history
- Demographics (age, gender, location)
- Health conditions
- Medications and treatments
- Genetic variants
**How it works in this pipeline:**
1. **User specifies disease**: For example, "schizophrenia" or "healthy"
2. **Statistical analysis**: Synthea analyzes the UK Biobank database (a large collection of genetic data from ~500,000 individuals) to find variants statistically associated with the disease
3. **Probability-based sampling**: Variants are selected based on their frequency in diseased vs. healthy populations
4. **VCF generation**: Creates a VCF file for each synthetic patient containing their unique set of genetic variants
**Parameters:**
```nextflow
params.disease = 'schizophrenia' // Disease to model
params.n_pat = 10 // Number of patients to generate
params.percent_male = 0.5 // Gender distribution
```
**For Healthy Patients:**
If generating healthy controls, Synthea samples from pre-computed reference genomes representing the genetic diversity of healthy populations.
**Output Example:**
```
Patient_001_variants.vcf
Patient_002_variants.vcf
...
Patient_010_variants.vcf
```
### 2. Borzoi: RNA-seq Prediction from DNA
```mermaid
flowchart TD
DNA[DNA Sequence
524,288 bp] --> Borzoi[Borzoi Neural Network]
subgraph Borzoi Architecture
Conv1[Convolutional Layers:
Learn local patterns]
Conv1 --> Attention[Self-Attention Layers:
Learn long-range interactions]
Attention --> Upsample[Upsampling Layers:
Increase resolution]
Upsample --> Output_Layer[Output: RNA Coverage
32 bp resolution]
end
Borzoi --> Tissues[Predictions for 89 Tissues/Cell Types]
Tissues --> TPM[TPM Values:
Transcripts Per Million]
style Borzoi fill:#87CEEB
style TPM fill:#90EE90
```
**What is Borzoi?**
Borzoi is a deep learning model developed at Calico Life Sciences that predicts **RNA-seq coverage** (how much RNA is produced from each part of the genome) directly from DNA sequence. This is revolutionary because it:
- Predicts gene expression without actually doing wet-lab RNA sequencing
- Accounts for multiple layers of regulation (transcription, splicing, polyadenylation)
- Provides tissue-specific predictions
**How does it work?**
1. **Input**: 524,288 base pairs of DNA sequence (524 kb)
2. **Neural Network Processing**:
- **Convolutional layers**: Learn local DNA patterns (e.g., transcription factor binding sites)
- **Self-attention layers**: Learn long-range interactions between regulatory elements
- **Upsampling layers**: Increase resolution from 128 bp to 32 bp
3. **Output**: RNA coverage at 32 bp resolution for 89 different tissues/cell types
**Key Concept: RNA-seq Coverage**
RNA-seq coverage shows how many RNA molecules were sequenced at each position in the genome:
```
Position: 1000 1100 1200 1300 1400
Exon 1: ████████████████
Intron: ░
Exon 2: ████████████
Coverage: 2.5 3.1 0.1 2.8 3.0
```
**TPM (Transcripts Per Million)**
Borzoi outputs are converted to **TPM** values:
```
TPM = (Number of reads mapped to transcript / Transcript length in kb)
× (1,000,000 / Total reads in sample)
```
**Why TPM?**
- Normalizes for gene length (longer genes generate more reads)
- Normalizes for sequencing depth (accounts for total number of reads)
- Comparable across genes within a sample
**Pipeline Implementation:**
The pipeline has two Borzoi processes:
**Process 1: FILTER_VCF**
```python
# Extract variants in coding regions + 1000 bp upstream regulatory regions
# Create filtered VCF containing only protein-coding variants
```
**Process 2: PREDICT_EXPRESSION**
```python
# For each protein-coding gene with mutations:
# 1. Extract DNA sequence (reference + mutations)
# 2. Run Borzoi to predict RNA coverage
# 3. Calculate TPM by summing coverage over exons
# 4. Generate TPM matrix: Genes × Tissues
```
**Example Output:**
```csv
Gene,Adipose_Tissue,Brain_Cortex,Heart,Liver,Muscle
BRCA1,12.5,8.3,5.2,15.7,7.9
TP53,45.2,52.1,38.9,42.3,35.6
APOE,8.7,125.3,6.1,78.2,5.4
```
This table shows predicted RNA expression (TPM) for each gene in each tissue.
**MANE Dataset**
The pipeline uses the **MANE (Matched Annotation from NCBI and EMBL-EBI)** dataset:
- Contains reference transcript sequences for all human protein-coding genes
- Provides consensus between RefSeq and Ensembl/GENCODE annotations
- Includes exon/intron boundaries needed for TPM calculation
### 3. VCF2Prot: DNA Variants to Protein Sequences
```mermaid
flowchart TD
VCF[VCF File:
Genetic Variants] --> Annotate[BCFtools CSQ:
Annotate Variants]
Reference[Reference Genome
GRCh38] --> Annotate
GFF[Gene Annotations
GFF3 Format] --> Annotate
Annotate --> Annotated_VCF[Annotated VCF:
Functional Consequences]
Annotated_VCF --> VCF2Prot[VCF2Prot Tool]
MANE_Ref[MANE Reference
Transcripts] --> VCF2Prot
VCF2Prot --> Mutated_Proteins[Mutated Protein
Sequences FASTA]
style VCF2Prot fill:#FFB6C1
style Mutated_Proteins fill:#90EE90
```
**What is VCF2Prot?**
VCF2Prot is a tool that translates DNA variants into their effects on protein sequences. It:
- Takes variants from VCF files
- Maps them to gene transcripts
- Predicts how variants change the protein sequence
- Outputs mutated protein sequences
**Process Flow:**
1. **Variant Annotation (BCFtools CSQ)**
- Maps variants to genes and transcripts
- Determines functional consequence:
- Missense: Changes one amino acid
- Nonsense: Creates premature stop codon
- Frameshift: Shifts reading frame
- Synonymous: No change to amino acid
2. **Protein Sequence Prediction (VCF2Prot)**
- Loads reference protein sequences from MANE
- Applies variants to generate mutated sequences
- Handles complex variants (insertions, deletions)
**Example:**
```
Reference DNA: ATG GCT AAA TGC
Reference RNA: AUG GCU AAA UGC
Reference Prot: Met-Ala-Lys-Cys
Variant: Position 5, G→T
Mutant DNA: ATG TCT AAA TGC
Mutant RNA: AUG UCU AAA UGC
Mutant Prot: Met-Ser-Lys-Cys
^^^
Changed amino acid!
```
**Output Format: FASTA**
```
>Patient_001_ENST00000357654_BRCA1_p.G1738R
MSLQSQLFKQRQYLSIKTKRSTKEVLDATLIHQSITGLYETRIDLSQLGGD...
>Patient_001_ENST00000269305_TP53_p.R273H
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIE...
```
Each sequence shows:
- Patient ID
- Transcript ID
- Gene name
- Protein variant notation
- Full mutated protein sequence
### 4. RNA2ProteinExpression: RNA to Protein Prediction
```mermaid
flowchart TD
RNA_TPM[RNA TPM Values
from Borzoi] --> Model[Deep Learning Model]
subgraph Neural Network
Input[Input Layer:
RNA Expression]
Hidden1[Hidden Layer 1:
Tissue Context]
Hidden2[Hidden Layer 2:
Translation Efficiency]
Output[Output Layer:
Protein Abundance]
Input --> Hidden1
Hidden1 --> Hidden2
Hidden2 --> Output
end
Model --> Protein_Expr[Protein Expression
Log2 Scale]
style Model fill:#FFE4E1
style Protein_Expr fill:#90EE90
```
**What is RNA2ProteinExpression?**
This is a custom deep learning model that predicts protein abundance from RNA expression levels. It's trained on:
- RNA-seq data (transcript levels)
- Mass spectrometry data (protein levels)
- Gene Ontology (GO) annotations
**Why is this needed?**
RNA levels don't perfectly correlate with protein levels because:
- **Translation efficiency** varies between genes
- **Protein stability** varies (some proteins are rapidly degraded)
- **Post-transcriptional regulation** (microRNAs, RNA-binding proteins)
Typical RNA-protein correlation: **r = 0.4-0.6** (not 1.0!)
**Model Architecture:**
The neural network learns:
- Which genes have high vs. low translation efficiency
- Tissue-specific effects on protein production
- GO term enrichments that affect protein stability
**Input:**
```
Gene_ID, Tissue, RNA_TPM
ENSG00000012048, Brain_Cortex, 125.3
ENSG00000012048, Liver, 78.2
```
**Output:**
```
Gene_ID, Tissue, Protein_Expression_log2
ENSG00000012048, Brain_Cortex, 8.5
ENSG00000012048, Liver, 7.2
```
**Log2 scale**: Protein expression in log2 transformed units (easier to interpret fold-changes)
### 5. CORTO: Metabolome Prediction
```mermaid
flowchart TD
RNA_TPM[RNA TPM Matrix] --> CORTO[CORTO Algorithm]
Regulon[Regulon Data:
TF-Gene Relationships] --> CORTO
subgraph CORTO Algorithm
Correlation[1. Calculate Correlations:
TF <-> Metabolic Genes]
DPI[2. Data Processing Inequality:
Remove Indirect Edges]
Bootstrap[3. Bootstrap:
Assess Robustness]
MRA[4. Master Regulator Analysis:
Identify Key TFs]
Correlation --> DPI
DPI --> Bootstrap
Bootstrap --> MRA
end
CORTO --> Metabolome[Metabolome Predictions:
Metabolic Activity]
style CORTO fill:#F0E68C
style Metabolome fill:#90EE90
```
**What is CORTO?**
CORTO (Correlation Tool) is an R package that infers gene regulatory networks and identifies master regulators controlling metabolic activity. It predicts:
- Activity of metabolic pathways
- Transcription factors (TFs) controlling metabolism
- Metabolite production levels
**How it works:**
1. **Input Regulon**: Pre-defined relationships between transcription factors (TFs) and their target genes (including metabolic enzymes)
2. **Correlation Analysis**: Calculate how TF expression correlates with target gene expression
3. **Data Processing Inequality (DPI)**: Remove indirect relationships
- If TF1 → TF2 → Gene, remove direct TF1 → Gene edge
- Keeps only direct regulatory relationships
4. **Bootstrap**: Test robustness by resampling data
5. **Master Regulator Analysis (MRA)**: Identify TFs whose target genes are significantly enriched in metabolic pathways
**Example:**
```
TF: PPARG (master regulator of fat metabolism)
Target Genes: FABP4, LPL, ADIPOQ, CD36, SCD (all involved in lipid metabolism)
Metabolome Prediction: High lipid synthesis activity
```
**Output:**
```csv
Pathway,Activity_Score,P_value
Glycolysis,-1.5,0.001
TCA_Cycle,2.3,0.0001
Fatty_Acid_Synthesis,1.8,0.002
```
- **Activity Score**: Positive = pathway activated, Negative = pathway suppressed
- **P-value**: Statistical significance
### 6. CIBERSORTx: Immune Cell Deconvolution
```mermaid
flowchart TD
RNA_TPM[Bulk RNA TPM
Mixed Cell Types] --> Signature[Signature Matrix:
Cell-Specific Genes]
subgraph CIBERSORTx
Fractions[Step 1: Fractions
Estimate Cell Proportions]
HiRes[Step 2: HiRes
Cell-Specific Expression]
end
RNA_TPM --> Fractions
Signature --> Fractions
Fractions --> Proportions[Cell Type Proportions]
Proportions --> HiRes
RNA_TPM --> HiRes
Signature --> HiRes
HiRes --> Cell_Specific[Cell-Type-Specific
Gene Expression]
style CIBERSORTx fill:#DDA0DD
style Proportions fill:#90EE90
style Cell_Specific fill:#90EE90
```
**What is CIBERSORTx?**
CIBERSORTx is a computational tool for **immune cell deconvolution**. When you sequence RNA from a tissue sample, you get a mixture of RNA from all cells in that tissue. CIBERSORTx:
- Estimates what proportion of cells are each immune cell type
- Infers cell-type-specific gene expression profiles
**Why is this important?**
Immune cells play crucial roles in:
- Fighting infections
- Cancer immunotherapy response
- Autoimmune diseases
- Inflammation
Understanding immune composition helps interpret disease mechanisms.
**How it works:**
**Step 1: Signature Matrix**
A reference matrix showing genes specifically expressed in each cell type:
```
Gene T_cells B_cells Macrophages NK_cells
CD3D HIGH low low low
CD19 low HIGH low low
CD68 low low HIGH low
NKG7 low low low HIGH
```
**Step 2: CIBERSORTx Fractions**
Uses **Support Vector Regression (SVR)** to solve:
```
Bulk_Expression = Σ (Proportion_i × Signature_i)
```
Where:
- Bulk_Expression = measured RNA in tissue
- Proportion_i = fraction of cell type i
- Signature_i = expression pattern of cell type i
**Step 3: CIBERSORTx HiRes**
After knowing proportions, infer gene expression within each cell type by:
- Modeling tissue expression as weighted sum of cell-type contributions
- Deconvolving to separate cell-type-specific signals
**Example Output:**
**Fractions:**
```csv
Sample,CD8_T_cells,CD4_T_cells,B_cells,NK_cells,Monocytes
Patient_001_Brain,0.05,0.08,0.02,0.01,0.15
Patient_001_Liver,0.12,0.15,0.08,0.03,0.22
```
**HiRes:**
```csv
Tissue,Cell_Type,CD3D,CD19,CD68
Brain_Patient_001,CD8_T_cells,HIGH,low,low
Brain_Patient_001,B_cells,low,HIGH,low
```
**Pipeline Implementation:**
1. **CONVERT_TO_TXT**: Convert CSV to tab-delimited format (CIBERSORTx input format)
2. **CIBERSORTx_FRACTIONS**: Estimate cell proportions
3. **CIBERSORTx_HIRES**: Infer cell-specific expression
4. **ADD_TISSUE_NAMES**: Add tissue annotations to output
---
## Workflow Execution
### Nextflow: Workflow Orchestration
```mermaid
flowchart TD
Config[nextflow.config:
Configuration] --> NF[Nextflow Engine]
Params[params.json:
Parameters] --> NF
subgraph Nextflow Engine
Parse[Parse Workflow DSL]
Schedule[Schedule Processes]
Execute[Execute in Docker/Singularity]
Monitor[Monitor & Checkpoint]
Parse --> Schedule
Schedule --> Execute
Execute --> Monitor
end
NF --> Channels[Data Channels:
Pass Files Between Processes]
Channels --> Processes[Execute Processes]
style NF fill:#87CEEB
```
**What is Nextflow?**
Nextflow is a workflow orchestration system specifically designed for data-intensive computational pipelines. It:
- Manages dependencies between analysis steps
- Handles parallel execution
- Provides automatic checkpointing (resume failed runs)
- Supports multiple execution platforms (local, HPC clusters, cloud)
**Key Concepts:**
1. **Processes**: Individual computational tasks (e.g., "PREDICT_EXPRESSION")
2. **Channels**: Data streams that connect processes
3. **Operators**: Manipulate channels (e.g., `mix`, `flatten`, `collect`)
**Example Process Definition:**
```nextflow
process PREDICT_EXPRESSION {
container "${params.container_borzoi}" // Docker image
memory 4.GB // Memory requirement
accelerator 1 // GPU requirement
input:
path vcf_filtered // Input file
path MANE // Reference data
output:
path "*_TPM.csv" // Output file pattern
script:
"""
#!/opt/conda/envs/borzoi/bin/python
# Python script here
"""
}
```
**Channel Example:**
```nextflow
// Mix male and female patient VCFs
txt_ch = f_var.mix(m_var).flatten()
// This creates a channel with all VCF files:
// [Patient_001.vcf, Patient_002.vcf, ...]
```
### Complete Workflow
```mermaid
flowchart TD
Start([Start]) --> CheckDisease{Disease or Healthy?}
CheckDisease -->|Disease| GetStats[get_disease_stats_no_patients:
Analyze UK Biobank]
CheckDisease -->|Healthy| LoadHealthy[Load Pre-computed
Healthy Genomes]
GetStats --> GenM[generate_m_variants_cudf:
Male Patients]
GetStats --> GenF[generate_f_variants_cudf:
Female Patients]
LoadHealthy --> LoadM[Load Male
Reference]
LoadHealthy --> LoadF[Load Female
Reference]
GenM --> MakeVCF[make_vcfs:
Generate VCF Files]
GenF --> MakeVCF
LoadM --> MakeVCF
LoadF --> MakeVCF
MakeVCF --> FilterVCF[FILTER_VCF:
Extract Coding Variants]
MakeVCF --> VCF2Prot[VCF2PROT:
Generate Mutated Proteins]
FilterVCF --> PredictExpr[PREDICT_EXPRESSION:
Borzoi RNA Prediction]
PredictExpr --> RNA2Prot[RNA2PROTEXPRESSION:
Protein Prediction]
PredictExpr --> CORTO[CORTO:
Metabolome Prediction]
PredictExpr --> Convert[CONVERT_TO_TXT:
Format Conversion]
Convert --> CiberFrac[CIBERSORTx_FRACTIONS:
Cell Proportions]
CiberFrac --> CiberHires[CIBERSORTx_HIRES:
Cell-Specific Expression]
CiberHires --> AddTissue[ADD_TISSUE_NAMES_TO_CIBERSORTX:
Annotate Results]
RNA2Prot --> End([Complete
Digital Patient])
CORTO --> End
AddTissue --> End
VCF2Prot --> End
style Start fill:#90EE90
style End fill:#FFB6C1
style PredictExpr fill:#87CEEB
```
### Execution Example
**1. Configuration (params.json)**
```json
{
"disease": "schizophrenia",
"n_pat": 10,
"percent_male": 0.5,
"container_borzoi": "harbor.cluster.omic.ai/omic/digital-patients/borzoi:latest"
}
```
**2. Launch Pipeline**
```bash
nextflow run test.nf -params-file params.json
```
**3. Nextflow Execution**
```
N E X T F L O W ~ version 21.04.0
Launching `test.nf` [amazing_babbage] - revision: 1a2b3c4d
[Synthea] Submitted process > get_disease_stats_no_patients
[Synthea] Submitted process > generate_m_variants_cudf (1)
[Synthea] Submitted process > generate_f_variants_cudf (1)
[Stage] Completed process > make_vcfs (10 files)
[Borzoi] Submitted process > FILTER_VCF (10)
[Borzoi] Submitted process > PREDICT_EXPRESSION (10)
...
Pipeline completed successfully!
```
**4. Directory Structure**
```
/outdir/
├── vcf2prot/
│ ├── Patient_001_transcript_id_mutations.fasta
│ └── Patient_002_transcript_id_mutations.fasta
├── borzoi/
│ ├── Patient_001_TPM.csv
│ └── Patient_002_TPM.csv
├── rna2protein/
│ ├── Patient_001_Protein_Expression_log2.csv
│ └── Patient_002_Protein_Expression_log2.csv
├── corto/
│ ├── Patient_001_metabolome.csv
│ └── Patient_002_metabolome.csv
└── ecotyper/
├── fractions/
│ └── Patient_001_CIBERSORTx_Results.txt
└── hires/
└── Patient_001_immune_cells.csv
```
---
## Technical Architecture
### Docker Containers
Each pipeline component runs in an isolated Docker container with specific dependencies:
```mermaid
flowchart LR
subgraph Docker Images
Synthea_Img[Synthea Container:
- Java JDK
- Python
- BCFtools
- GATK]
Borzoi_Img[Borzoi Container:
- TensorFlow
- PyTorch
- Baskerville
- Python packages]
VCF2Prot_Img[VCF2Prot Container:
- BCFtools
- vcf2prot binary
- Reference genomes]
RNA2Prot_Img[RNA2Protein Container:
- PyTorch
- Deep learning model
- GO annotations]
CORTO_Img[CORTO Container:
- R
- corto package
- Regulon data]
CIBERSORTx_Img[CIBERSORTx Container:
- Python
- R
- CIBERSORTx binaries
- Signature matrices]
end
Registry[Container Registry:
harbor.cluster.omic.ai] --> Synthea_Img
Registry --> Borzoi_Img
Registry --> VCF2Prot_Img
Registry --> RNA2Prot_Img
Registry --> CORTO_Img
Registry --> CIBERSORTx_Img
```
**Why Docker?**
- **Reproducibility**: Same environment every run
- **Isolation**: Avoid dependency conflicts
- **Portability**: Run anywhere (laptop, cluster, cloud)
**Example Dockerfile (Borzoi):**
```dockerfile
FROM tensorflow/tensorflow:2.12.0-gpu
# Install conda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
RUN bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda
# Create borzoi environment
RUN conda create -n borzoi python=3.9
RUN conda install -n borzoi tensorflow-gpu pandas numpy pysam
# Install baskerville (Borzoi's framework)
RUN git clone https://github.com/calico/borzoi.git /home/omic/borzoi
RUN pip install -e /home/omic/borzoi
# Download pre-trained models
RUN mkdir -p /home/omic/borzoi/saved_models
RUN wget -O /home/omic/borzoi/saved_models/f0/model0_best.h5
# Set entrypoint
CMD ["/bin/bash"]
```
### Data Flow Architecture
```mermaid
flowchart TD
subgraph Input Data
UKBB[UK Biobank:
Genetic Variants]
MANE[MANE Transcripts:
Reference Sequences]
RefGenome[Reference Genome:
GRCh38]
Regulon[Regulon Database:
TF-Gene Networks]
LM22[LM22 Signature Matrix:
Immune Cell Markers]
end
subgraph Processing
Synthea --> |VCF| Filter
Filter --> |Filtered VCF| Borzoi
Filter --> |Filtered VCF| VCF2Prot
Borzoi --> |TPM CSV| RNA2Prot
Borzoi --> |TPM CSV| CORTO
Borzoi --> |TPM CSV| CIBERSORTx
end
UKBB --> Synthea
MANE --> Borzoi
MANE --> VCF2Prot
RefGenome --> VCF2Prot
Regulon --> CORTO
LM22 --> CIBERSORTx
subgraph Output Data
RNA2Prot --> |Protein Expression| Results
CORTO --> |Metabolome| Results
CIBERSORTx --> |Immune Cells| Results
VCF2Prot --> |Mutated Proteins| Results
end
```
### Computational Requirements
| Process | CPU | RAM | GPU | Time (per patient) |
| ------------------ | ------- | ---- | -------- | ------------------ |
| Synthea | 2 cores | 4 GB | No | ~5 min |
| FILTER_VCF | 4 cores | 4 GB | No | ~2 min |
| PREDICT_EXPRESSION | 8 cores | 4 GB | Yes (1x) | ~30-60 min |
| VCF2PROT | 2 cores | 2 GB | No | ~10 min |
| RNA2PROTEXPRESSION | 4 cores | 2 GB | Yes (1x) | ~5 min |
| CORTO | 2 cores | 1 GB | No | ~3 min |
| CIBERSORTx | 4 cores | 4 GB | No | ~15 min |
**Total time per patient: ~70-90 minutes**
**Parallelization:**
Nextflow automatically parallelizes patient processing:
- 10 patients with 4 GPUs → ~20-25 minutes total
- Patients are processed independently
---
## Outputs and Applications
### Complete Digital Patient Profile
```mermaid
flowchart LR
Patient[Digital Patient:
Patient_001] --> Genome[Genomic Profile]
Patient --> Transcriptome[Transcriptomic Profile]
Patient --> Proteome[Proteomic Profile]
Patient --> Metabolome[Metabolomic Profile]
Patient --> Immune[Immune Profile]
Genome --> G1[Genetic Variants:
SNPs, Indels]
Genome --> G2[Disease-Associated
Mutations]
Transcriptome --> T1[RNA Expression:
20,000+ genes]
Transcriptome --> T2[Tissue-Specific
Expression]
Proteome --> P1[Protein Abundance:
10,000+ proteins]
Proteome --> P2[Mutated Protein
Sequences]
Metabolome --> M1[Pathway Activity:
Metabolism]
Metabolome --> M2[Master Regulator
TFs]
Immune --> I1[Cell Composition:
T-cells, B-cells, etc.]
Immune --> I2[Cell-Specific
Expression]
style Patient fill:#FFD700
```
### Example Application: Cancer Research
```mermaid
flowchart TD
Question[Research Question:
How do BRCA1 mutations
affect breast cancer?]
Question --> Generate[Generate 100 synthetic
patients with BRCA1
mutations]
Generate --> Analyze{Analyze Digital Patients}
Analyze --> RNA[RNA Analysis:
Find genes
co-expressed with BRCA1]
Analyze --> Protein[Protein Analysis:
Identify altered
pathways]
Analyze --> Metabolome[Metabolome Analysis:
Detect metabolic
shifts]
Analyze --> Immune[Immune Analysis:
Characterize immune
infiltration]
RNA --> Insight[Insights into
Disease Mechanisms]
Protein --> Insight
Metabolome --> Insight
Immune --> Insight
Insight --> Drug[Drug Target
Discovery]
Insight --> Biomarker[Biomarker
Identification]
style Question fill:#FFB6C1
style Insight fill:#90EE90
```
### Output Files Summary
**1. Genetic Variants (VCF)**
- **File**: `Patient_001_variants.vcf`
- **Content**: All genetic variants for patient
- **Use**: Understand genetic basis of disease
**2. RNA Expression (TPM)**
- **File**: `Patient_001_TPM.csv`
- **Content**: Gene expression across 89 tissues
- **Use**: Identify dysregulated genes, tissue-specific effects
**3. Protein Expression**
- **File**: `Patient_001_Protein_Expression_log2.csv`
- **Content**: Predicted protein abundance
- **Use**: Understand functional consequences of RNA changes
**4. Mutated Proteins (FASTA)**
- **File**: `Patient_001_transcript_id_mutations.fasta`
- **Content**: Protein sequences with mutations
- **Use**: Study structural changes, predict drug binding
**5. Metabolome**
- **File**: `Patient_001_metabolome.csv`
- **Content**: Pathway activity scores
- **Use**: Understand metabolic reprogramming
**6. Immune Cells**
- **File**: `Patient_001_immune_cells.csv`
- **Content**: Cell type proportions and expression
- **Use**: Characterize immune microenvironment
### Research Applications
1. **Disease Mechanism Discovery**
- Generate patients with specific mutations
- Compare to healthy controls
- Identify molecular changes caused by mutations
2. **Drug Target Identification**
- Find genes/proteins consistently altered across patients
- Prioritize targets for therapeutic intervention
3. **Biomarker Discovery**
- Identify molecular signatures distinguishing diseased from healthy
- Develop diagnostic tests
4. **Precision Medicine**
- Model individual patient molecular profiles
- Predict treatment response
- Personalize therapy
5. **Clinical Trial Simulation**
- Generate virtual patient cohorts
- Test hypotheses before expensive trials
- Power calculations and study design
6. **Education and Training**
- Teach students about multi-omics analysis
- No patient privacy concerns
- Unlimited data generation
---
## Conclusion
This Digital Patient Pipeline represents a cutting-edge integration of:
- **Synthetic biology**: Realistic patient simulation
- **Deep learning**: RNA expression prediction from DNA
- **Bioinformatics**: Multi-omic data integration
- **Workflow engineering**: Scalable, reproducible pipelines
By combining these technologies, researchers can:
- Study disease mechanisms without accessing sensitive patient data
- Generate unlimited data for hypothesis testing
- Model personalized molecular phenotypes
- Accelerate drug discovery and precision medicine
The pipeline produces comprehensive molecular profiles spanning:
- **Genomics** (DNA variants)
- **Transcriptomics** (RNA expression)
- **Proteomics** (protein abundance)
- **Metabolomics** (metabolic activity)
- **Immunomics** (immune cell composition)
This multi-omic integration provides unprecedented insight into how genetic variants cascade through biological systems to produce disease phenotypes.
---
## Glossary of Terms
**Bioinformatics Terms:**
- **TPM**: Transcripts Per Million - normalized measure of RNA abundance
- **VCF**: Variant Call Format - standard format for genetic variants
- **FASTA**: Text format for representing DNA/protein sequences
- **SNP**: Single Nucleotide Polymorphism - single base DNA variant
- **Indel**: Insertion or Deletion - adding or removing DNA bases
- **Exon**: Protein-coding segment of gene
- **Intron**: Non-coding segment removed during RNA processing
- **Transcription**: DNA → RNA
- **Translation**: RNA → Protein
- **Gene Expression**: Process of making protein from gene
**Machine Learning Terms:**
- **Neural Network**: Computer model inspired by brain structure
- **Convolutional Layer**: Detects local patterns in sequences
- **Attention Layer**: Learns long-range relationships
- **Training**: Teaching model from example data
- **Prediction**: Using model on new data
**Pipeline Terms:**
- **Process**: Individual computational step
- **Channel**: Data stream connecting processes
- **Container**: Isolated environment with dependencies
- **Workflow**: Series of connected processes
- **Nextflow**: Workflow orchestration system
**Biological Terms:**
- **Genome**: Complete set of DNA in organism
- **Transcriptome**: Complete set of RNA molecules
- **Proteome**: Complete set of proteins
- **Metabolome**: Complete set of metabolites
- **Phenotype**: Observable characteristics of organism
---
## References
### Pipeline Components
1. **Nextflow**: Di Tommaso et al. (2017). Nextflow enables reproducible computational workflows. *Nature Biotechnology*, 35:316-319.
2. **Synthea**: Walone et al. (2017). Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. *JAMIA*, 25(3):230-238.
3. **Borzoi**: Linder et al. (2023). Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. *Nature Genetics*, 56:164-173.
4. **CIBERSORTx**: Newman et al. (2019). Determining cell-type abundance and expression from bulk tissues with digital cytometry. *Nature Biotechnology*, 37:773-782.
5. **CORTO**: Mercatelli et al. (2020). corto: a lightweight R package for gene network inference and master regulator analysis. *Bioinformatics*, 36(12):3916-3917.
### Biological Background
6. **Gene Structure**: Lim et al. (2018). The exon-intron gene structure upstream of the initiation codon. *Nucleic Acids Research*, 46(5):2232-2244.
7. **TPM Normalization**: Zhao et al. (2020). Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols. *RNA*, 26(8):903-909.
8. **VCF Format**: Danecek et al. (2011). The variant call format and VCFtools. *Bioinformatics*, 27(15):2156-2158.
9. **MANE**: Morales et al. (2022). A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. *Nature*, 604:310-315.
---
*This documentation was created to provide comprehensive understanding of a complex bioinformatics pipeline for researchers without extensive biological background. For questions or clarifications, please consult the cited references or contact the pipeline maintainers.*