Large reference/model files excluded from repo - to be staged to S3 or baked into Docker images.
Digital Patient Pipeline: Complete Documentation
Table of Contents
- Introduction
- Pipeline Overview
- Biological Background
- Pipeline Components
- Workflow Execution
- Technical Architecture
- Outputs and Applications
Introduction
This document provides an exhaustive explanation of a Digital Patient Pipeline - a sophisticated bioinformatics workflow that generates synthetic patient data and predicts multiple layers of molecular biology from genomic variants. The pipeline is implemented using Nextflow, a workflow orchestration system, and integrates multiple cutting-edge computational biology tools to simulate how genetic mutations affect gene expression, protein production, immune cell composition, and metabolic activity.
Purpose
The Digital Patient Pipeline serves several critical purposes:
- Synthetic Patient Generation: Creates realistic but synthetic patient profiles with genetic variants associated with specific diseases
- Multi-Omic Prediction: Predicts gene expression (RNA), protein abundance, immune cell composition, and metabolic activity from DNA sequence alone
- Clinical Research: Enables researchers to study disease mechanisms without accessing sensitive patient data
- Personalized Medicine: Models how individual genetic variants affect molecular phenotypes
Pipeline Overview
flowchart TD
Start([Start Pipeline]) --> Decision{Patient Type?}
Decision -->|Disease| Synthea[Synthea: Generate Disease Patients]
Decision -->|Healthy| Healthy[Synthea: Generate Healthy Patients]
Synthea --> VCF[VCF Files: Genetic Variants]
Healthy --> VCF
VCF --> FilterVCF[Filter VCF: Extract Coding Variants]
VCF --> VCF2Prot[VCF2Prot: Generate Mutated Proteins]
FilterVCF --> Borzoi[Borzoi: Predict RNA Expression]
Borzoi --> RNA2Prot[RNA2Protein: Predict Protein Expression]
Borzoi --> CORTO[CORTO: Predict Metabolome]
Borzoi --> CIBERSORTx[CIBERSORTx: Predict Immune Cells]
RNA2Prot --> Output1[Protein Expression Profiles]
CORTO --> Output2[Metabolic Activity Profiles]
CIBERSORTx --> Output3[Immune Cell Composition]
VCF2Prot --> Output4[Mutated Protein Sequences]
Output1 --> End([Complete Digital Patient])
Output2 --> End
Output3 --> End
Output4 --> End
style Start fill:#90EE90
style End fill:#FFB6C1
style Borzoi fill:#87CEEB
style Synthea fill:#DDA0DD
Pipeline Flow Summary
- Patient Generation: Synthea generates synthetic patients with realistic genetic variants
- Variant Processing: VCF files containing genetic mutations are filtered and processed
- RNA Expression Prediction: Borzoi predicts how mutations affect gene expression
- Downstream Analysis: Multiple tools analyze predicted RNA to generate comprehensive molecular profiles
- Integration: Results are combined to create a complete "digital patient"
Biological Background
To understand this pipeline, we need to understand the central dogma of molecular biology and how genetic information flows through biological systems.
The Central Dogma: DNA → RNA → Protein
flowchart LR
DNA[DNA: Genetic Code] -->|Transcription| RNA[RNA: Message]
RNA -->|Translation| Protein[Protein: Function]
Protein --> Phenotype[Cellular Phenotype]
style DNA fill:#FFE4B5
style RNA fill:#E0FFFF
style Protein fill:#FFE4E1
style Phenotype fill:#F0E68C
1. DNA (Deoxyribonucleic Acid)
DNA is the blueprint of life, containing the genetic instructions for all cellular functions. DNA consists of:
- Four nucleotide bases: Adenine (A), Thymine (T), Guanine (G), Cytosine (C)
- Double helix structure: Two complementary strands wound together
- Genes: Specific segments of DNA that encode instructions for proteins
Example DNA Sequence: ATGCGATCCGTA
2. RNA (Ribonucleic Acid)
During transcription, DNA is copied into RNA:
- RNA polymerase enzyme reads DNA and creates a complementary RNA strand
- RNA uses Uracil (U) instead of Thymine (T)
- The RNA carries the genetic message from the nucleus to protein-making machinery
Example RNA Sequence: AUGCGAUCCGUA (from DNA above)
3. Proteins
During translation, RNA is decoded to build proteins:
- Ribosomes read RNA in groups of three bases called codons
- Each codon specifies one amino acid
- Amino acids chain together to form proteins
Example: AUG → Methionine, CGA → Arginine, UCC → Serine, GUA → Valine
Protein: Methionine-Arginine-Serine-Valine
Gene Structure
Genes in eukaryotes (organisms with nuclei, like humans) have a complex structure:
flowchart LR
subgraph Gene Structure
Promoter[Promoter: -1000bp] --> UTR5[5' UTR]
UTR5 --> Exon1[Exon 1]
Exon1 --> Intron1[Intron 1]
Intron1 --> Exon2[Exon 2]
Exon2 --> Intron2[Intron 2]
Intron2 --> Exon3[Exon 3]
Exon3 --> UTR3[3' UTR]
end
Exon1 -.->|Splicing| mRNA[Mature mRNA]
Exon2 -.-> mRNA
Exon3 -.-> mRNA
style Exon1 fill:#90EE90
style Exon2 fill:#90EE90
style Exon3 fill:#90EE90
style Intron1 fill:#FFB6C1
style Intron2 fill:#FFB6C1
style Promoter fill:#FFD700
Key Components:
- Promoter: Regulatory region upstream of gene (-1000 bp) that controls when the gene is turned on
- 5' UTR (Untranslated Region): Beginning of RNA that isn't translated into protein
- Exons: Segments that ARE kept in the final RNA and code for protein
- Introns: Segments that are REMOVED during RNA processing
- 3' UTR: End region of RNA that isn't translated
Why this matters: The Borzoi model in this pipeline predicts which parts of genes will be transcribed into RNA, including both exons and introns, before they're processed.
Genetic Variants and Their Effects
flowchart TD
Variant[Genetic Variant] --> Type{Type?}
Type -->|SNP| SNP[Single Nucleotide<br/>Polymorphism:<br/>A->G]
Type -->|Insertion| INS[Insertion:<br/>ATCG->ATCGGG]
Type -->|Deletion| DEL[Deletion:<br/>ATCG->A]
SNP --> Effect1{Location?}
INS --> Effect1
DEL --> Effect1
Effect1 -->|Promoter| E1[Changes expression level]
Effect1 -->|Exon| E2[Changes protein sequence]
Effect1 -->|Splice site| E3[Changes splicing pattern]
Effect1 -->|Intron| E4[May affect regulation]
style Variant fill:#FFB6C1
style E1 fill:#87CEEB
style E2 fill:#87CEEB
style E3 fill:#87CEEB
style E4 fill:#87CEEB
Variants are differences in DNA sequence between individuals:
- SNP (Single Nucleotide Polymorphism): Single base change (e.g., A→G)
- Insertion: Extra bases added
- Deletion: Bases removed
- Structural Variant: Large-scale DNA rearrangements
These variants can affect:
- Gene expression: How much RNA is made
- Protein sequence: Which amino acids are in the protein
- Splicing: Which exons are included in mature RNA
VCF Format: Storing Genetic Variants
The VCF (Variant Call Format) is a standardized text file format that stores genetic variants:
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 12345 rs123 A G 100 PASS DP=50
chr2 67890 rs456 TC T 95 PASS DP=45
Columns explained:
- CHROM: Chromosome where variant is located (chr1, chr2, etc.)
- POS: Position on the chromosome (base pair number)
- ID: Database identifier (often from dbSNP database)
- REF: Reference base(s) at this position
- ALT: Alternate base(s) - the variant
- QUAL: Quality score for the variant call
- FILTER: Whether variant passed quality filters
- INFO: Additional information (e.g., read depth)
Pipeline Components
1. Synthea: Synthetic Patient Generator
flowchart LR
Input[Input Parameters] --> Synthea{Synthea Engine}
subgraph Input Parameters
Disease[Disease Type:<br/>schizophrenia, cancer, etc.]
Demographics[Demographics:<br/>Age, Gender, Location]
N[Number of Patients]
end
Synthea --> UKBB[UK Biobank<br/>Genetic Database]
UKBB --> Disease_Variants[Disease-Associated<br/>Genetic Variants]
Disease_Variants --> VCF_Out[VCF Files<br/>Patient_001.vcf<br/>Patient_002.vcf]
style Synthea fill:#DDA0DD
style VCF_Out fill:#90EE90
What is Synthea?
Synthea™ is an open-source synthetic patient generator that creates realistic but completely fake patient data. It models:
- Medical history
- Demographics (age, gender, location)
- Health conditions
- Medications and treatments
- Genetic variants
How it works in this pipeline:
- User specifies disease: For example, "schizophrenia" or "healthy"
- Statistical analysis: Synthea analyzes the UK Biobank database (a large collection of genetic data from ~500,000 individuals) to find variants statistically associated with the disease
- Probability-based sampling: Variants are selected based on their frequency in diseased vs. healthy populations
- VCF generation: Creates a VCF file for each synthetic patient containing their unique set of genetic variants
Parameters:
params.disease = 'schizophrenia' // Disease to model
params.n_pat = 10 // Number of patients to generate
params.percent_male = 0.5 // Gender distribution
For Healthy Patients:
If generating healthy controls, Synthea samples from pre-computed reference genomes representing the genetic diversity of healthy populations.
Output Example:
Patient_001_variants.vcf
Patient_002_variants.vcf
...
Patient_010_variants.vcf
2. Borzoi: RNA-seq Prediction from DNA
flowchart TD
DNA[DNA Sequence<br/>524,288 bp] --> Borzoi[Borzoi Neural Network]
subgraph Borzoi Architecture
Conv1[Convolutional Layers:<br/>Learn local patterns]
Conv1 --> Attention[Self-Attention Layers:<br/>Learn long-range interactions]
Attention --> Upsample[Upsampling Layers:<br/>Increase resolution]
Upsample --> Output_Layer[Output: RNA Coverage<br/>32 bp resolution]
end
Borzoi --> Tissues[Predictions for 89 Tissues/Cell Types]
Tissues --> TPM[TPM Values:<br/>Transcripts Per Million]
style Borzoi fill:#87CEEB
style TPM fill:#90EE90
What is Borzoi?
Borzoi is a deep learning model developed at Calico Life Sciences that predicts RNA-seq coverage (how much RNA is produced from each part of the genome) directly from DNA sequence. This is revolutionary because it:
- Predicts gene expression without actually doing wet-lab RNA sequencing
- Accounts for multiple layers of regulation (transcription, splicing, polyadenylation)
- Provides tissue-specific predictions
How does it work?
- Input: 524,288 base pairs of DNA sequence (524 kb)
- Neural Network Processing:
- Convolutional layers: Learn local DNA patterns (e.g., transcription factor binding sites)
- Self-attention layers: Learn long-range interactions between regulatory elements
- Upsampling layers: Increase resolution from 128 bp to 32 bp
- Output: RNA coverage at 32 bp resolution for 89 different tissues/cell types
Key Concept: RNA-seq Coverage
RNA-seq coverage shows how many RNA molecules were sequenced at each position in the genome:
Position: 1000 1100 1200 1300 1400
Exon 1: ████████████████
Intron: ░
Exon 2: ████████████
Coverage: 2.5 3.1 0.1 2.8 3.0
TPM (Transcripts Per Million)
Borzoi outputs are converted to TPM values:
TPM = (Number of reads mapped to transcript / Transcript length in kb)
× (1,000,000 / Total reads in sample)
Why TPM?
- Normalizes for gene length (longer genes generate more reads)
- Normalizes for sequencing depth (accounts for total number of reads)
- Comparable across genes within a sample
Pipeline Implementation:
The pipeline has two Borzoi processes:
Process 1: FILTER_VCF
# Extract variants in coding regions + 1000 bp upstream regulatory regions
# Create filtered VCF containing only protein-coding variants
Process 2: PREDICT_EXPRESSION
# For each protein-coding gene with mutations:
# 1. Extract DNA sequence (reference + mutations)
# 2. Run Borzoi to predict RNA coverage
# 3. Calculate TPM by summing coverage over exons
# 4. Generate TPM matrix: Genes × Tissues
Example Output:
Gene,Adipose_Tissue,Brain_Cortex,Heart,Liver,Muscle
BRCA1,12.5,8.3,5.2,15.7,7.9
TP53,45.2,52.1,38.9,42.3,35.6
APOE,8.7,125.3,6.1,78.2,5.4
This table shows predicted RNA expression (TPM) for each gene in each tissue.
MANE Dataset
The pipeline uses the MANE (Matched Annotation from NCBI and EMBL-EBI) dataset:
- Contains reference transcript sequences for all human protein-coding genes
- Provides consensus between RefSeq and Ensembl/GENCODE annotations
- Includes exon/intron boundaries needed for TPM calculation
3. VCF2Prot: DNA Variants to Protein Sequences
flowchart TD
VCF[VCF File:<br/>Genetic Variants] --> Annotate[BCFtools CSQ:<br/>Annotate Variants]
Reference[Reference Genome<br/>GRCh38] --> Annotate
GFF[Gene Annotations<br/>GFF3 Format] --> Annotate
Annotate --> Annotated_VCF[Annotated VCF:<br/>Functional Consequences]
Annotated_VCF --> VCF2Prot[VCF2Prot Tool]
MANE_Ref[MANE Reference<br/>Transcripts] --> VCF2Prot
VCF2Prot --> Mutated_Proteins[Mutated Protein<br/>Sequences FASTA]
style VCF2Prot fill:#FFB6C1
style Mutated_Proteins fill:#90EE90
What is VCF2Prot?
VCF2Prot is a tool that translates DNA variants into their effects on protein sequences. It:
- Takes variants from VCF files
- Maps them to gene transcripts
- Predicts how variants change the protein sequence
- Outputs mutated protein sequences
Process Flow:
-
Variant Annotation (BCFtools CSQ)
- Maps variants to genes and transcripts
- Determines functional consequence:
- Missense: Changes one amino acid
- Nonsense: Creates premature stop codon
- Frameshift: Shifts reading frame
- Synonymous: No change to amino acid
-
Protein Sequence Prediction (VCF2Prot)
- Loads reference protein sequences from MANE
- Applies variants to generate mutated sequences
- Handles complex variants (insertions, deletions)
Example:
Reference DNA: ATG GCT AAA TGC
Reference RNA: AUG GCU AAA UGC
Reference Prot: Met-Ala-Lys-Cys
Variant: Position 5, G→T
Mutant DNA: ATG TCT AAA TGC
Mutant RNA: AUG UCU AAA UGC
Mutant Prot: Met-Ser-Lys-Cys
^^^
Changed amino acid!
Output Format: FASTA
>Patient_001_ENST00000357654_BRCA1_p.G1738R
MSLQSQLFKQRQYLSIKTKRSTKEVLDATLIHQSITGLYETRIDLSQLGGD...
>Patient_001_ENST00000269305_TP53_p.R273H
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIE...
Each sequence shows:
- Patient ID
- Transcript ID
- Gene name
- Protein variant notation
- Full mutated protein sequence
4. RNA2ProteinExpression: RNA to Protein Prediction
flowchart TD
RNA_TPM[RNA TPM Values<br/>from Borzoi] --> Model[Deep Learning Model]
subgraph Neural Network
Input[Input Layer:<br/>RNA Expression]
Hidden1[Hidden Layer 1:<br/>Tissue Context]
Hidden2[Hidden Layer 2:<br/>Translation Efficiency]
Output[Output Layer:<br/>Protein Abundance]
Input --> Hidden1
Hidden1 --> Hidden2
Hidden2 --> Output
end
Model --> Protein_Expr[Protein Expression<br/>Log2 Scale]
style Model fill:#FFE4E1
style Protein_Expr fill:#90EE90
What is RNA2ProteinExpression?
This is a custom deep learning model that predicts protein abundance from RNA expression levels. It's trained on:
- RNA-seq data (transcript levels)
- Mass spectrometry data (protein levels)
- Gene Ontology (GO) annotations
Why is this needed?
RNA levels don't perfectly correlate with protein levels because:
- Translation efficiency varies between genes
- Protein stability varies (some proteins are rapidly degraded)
- Post-transcriptional regulation (microRNAs, RNA-binding proteins)
Typical RNA-protein correlation: r = 0.4-0.6 (not 1.0!)
Model Architecture:
The neural network learns:
- Which genes have high vs. low translation efficiency
- Tissue-specific effects on protein production
- GO term enrichments that affect protein stability
Input:
Gene_ID, Tissue, RNA_TPM
ENSG00000012048, Brain_Cortex, 125.3
ENSG00000012048, Liver, 78.2
Output:
Gene_ID, Tissue, Protein_Expression_log2
ENSG00000012048, Brain_Cortex, 8.5
ENSG00000012048, Liver, 7.2
Log2 scale: Protein expression in log2 transformed units (easier to interpret fold-changes)
5. CORTO: Metabolome Prediction
flowchart TD
RNA_TPM[RNA TPM Matrix] --> CORTO[CORTO Algorithm]
Regulon[Regulon Data:<br/>TF-Gene Relationships] --> CORTO
subgraph CORTO Algorithm
Correlation[1. Calculate Correlations:<br/>TF <-> Metabolic Genes]
DPI[2. Data Processing Inequality:<br/>Remove Indirect Edges]
Bootstrap[3. Bootstrap:<br/>Assess Robustness]
MRA[4. Master Regulator Analysis:<br/>Identify Key TFs]
Correlation --> DPI
DPI --> Bootstrap
Bootstrap --> MRA
end
CORTO --> Metabolome[Metabolome Predictions:<br/>Metabolic Activity]
style CORTO fill:#F0E68C
style Metabolome fill:#90EE90
What is CORTO?
CORTO (Correlation Tool) is an R package that infers gene regulatory networks and identifies master regulators controlling metabolic activity. It predicts:
- Activity of metabolic pathways
- Transcription factors (TFs) controlling metabolism
- Metabolite production levels
How it works:
-
Input Regulon: Pre-defined relationships between transcription factors (TFs) and their target genes (including metabolic enzymes)
-
Correlation Analysis: Calculate how TF expression correlates with target gene expression
-
Data Processing Inequality (DPI): Remove indirect relationships
- If TF1 → TF2 → Gene, remove direct TF1 → Gene edge
- Keeps only direct regulatory relationships
-
Bootstrap: Test robustness by resampling data
-
Master Regulator Analysis (MRA): Identify TFs whose target genes are significantly enriched in metabolic pathways
Example:
TF: PPARG (master regulator of fat metabolism)
Target Genes: FABP4, LPL, ADIPOQ, CD36, SCD (all involved in lipid metabolism)
Metabolome Prediction: High lipid synthesis activity
Output:
Pathway,Activity_Score,P_value
Glycolysis,-1.5,0.001
TCA_Cycle,2.3,0.0001
Fatty_Acid_Synthesis,1.8,0.002
- Activity Score: Positive = pathway activated, Negative = pathway suppressed
- P-value: Statistical significance
6. CIBERSORTx: Immune Cell Deconvolution
flowchart TD
RNA_TPM[Bulk RNA TPM<br/>Mixed Cell Types] --> Signature[Signature Matrix:<br/>Cell-Specific Genes]
subgraph CIBERSORTx
Fractions[Step 1: Fractions<br/>Estimate Cell Proportions]
HiRes[Step 2: HiRes<br/>Cell-Specific Expression]
end
RNA_TPM --> Fractions
Signature --> Fractions
Fractions --> Proportions[Cell Type Proportions]
Proportions --> HiRes
RNA_TPM --> HiRes
Signature --> HiRes
HiRes --> Cell_Specific[Cell-Type-Specific<br/>Gene Expression]
style CIBERSORTx fill:#DDA0DD
style Proportions fill:#90EE90
style Cell_Specific fill:#90EE90
What is CIBERSORTx?
CIBERSORTx is a computational tool for immune cell deconvolution. When you sequence RNA from a tissue sample, you get a mixture of RNA from all cells in that tissue. CIBERSORTx:
- Estimates what proportion of cells are each immune cell type
- Infers cell-type-specific gene expression profiles
Why is this important?
Immune cells play crucial roles in:
- Fighting infections
- Cancer immunotherapy response
- Autoimmune diseases
- Inflammation
Understanding immune composition helps interpret disease mechanisms.
How it works:
Step 1: Signature Matrix
A reference matrix showing genes specifically expressed in each cell type:
Gene T_cells B_cells Macrophages NK_cells
CD3D HIGH low low low
CD19 low HIGH low low
CD68 low low HIGH low
NKG7 low low low HIGH
Step 2: CIBERSORTx Fractions
Uses Support Vector Regression (SVR) to solve:
Bulk_Expression = Σ (Proportion_i × Signature_i)
Where:
- Bulk_Expression = measured RNA in tissue
- Proportion_i = fraction of cell type i
- Signature_i = expression pattern of cell type i
Step 3: CIBERSORTx HiRes
After knowing proportions, infer gene expression within each cell type by:
- Modeling tissue expression as weighted sum of cell-type contributions
- Deconvolving to separate cell-type-specific signals
Example Output:
Fractions:
Sample,CD8_T_cells,CD4_T_cells,B_cells,NK_cells,Monocytes
Patient_001_Brain,0.05,0.08,0.02,0.01,0.15
Patient_001_Liver,0.12,0.15,0.08,0.03,0.22
HiRes:
Tissue,Cell_Type,CD3D,CD19,CD68
Brain_Patient_001,CD8_T_cells,HIGH,low,low
Brain_Patient_001,B_cells,low,HIGH,low
Pipeline Implementation:
-
CONVERT_TO_TXT: Convert CSV to tab-delimited format (CIBERSORTx input format)
-
CIBERSORTx_FRACTIONS: Estimate cell proportions
-
CIBERSORTx_HIRES: Infer cell-specific expression
-
ADD_TISSUE_NAMES: Add tissue annotations to output
Workflow Execution
Nextflow: Workflow Orchestration
flowchart TD
Config[nextflow.config:<br/>Configuration] --> NF[Nextflow Engine]
Params[params.json:<br/>Parameters] --> NF
subgraph Nextflow Engine
Parse[Parse Workflow DSL]
Schedule[Schedule Processes]
Execute[Execute in Docker/Singularity]
Monitor[Monitor & Checkpoint]
Parse --> Schedule
Schedule --> Execute
Execute --> Monitor
end
NF --> Channels[Data Channels:<br/>Pass Files Between Processes]
Channels --> Processes[Execute Processes]
style NF fill:#87CEEB
What is Nextflow?
Nextflow is a workflow orchestration system specifically designed for data-intensive computational pipelines. It:
- Manages dependencies between analysis steps
- Handles parallel execution
- Provides automatic checkpointing (resume failed runs)
- Supports multiple execution platforms (local, HPC clusters, cloud)
Key Concepts:
- Processes: Individual computational tasks (e.g., "PREDICT_EXPRESSION")
- Channels: Data streams that connect processes
- Operators: Manipulate channels (e.g.,
mix,flatten,collect)
Example Process Definition:
process PREDICT_EXPRESSION {
container "${params.container_borzoi}" // Docker image
memory 4.GB // Memory requirement
accelerator 1 // GPU requirement
input:
path vcf_filtered // Input file
path MANE // Reference data
output:
path "*_TPM.csv" // Output file pattern
script:
"""
#!/opt/conda/envs/borzoi/bin/python
# Python script here
"""
}
Channel Example:
// Mix male and female patient VCFs
txt_ch = f_var.mix(m_var).flatten()
// This creates a channel with all VCF files:
// [Patient_001.vcf, Patient_002.vcf, ...]
Complete Workflow
flowchart TD
Start([Start]) --> CheckDisease{Disease or Healthy?}
CheckDisease -->|Disease| GetStats[get_disease_stats_no_patients:<br/>Analyze UK Biobank]
CheckDisease -->|Healthy| LoadHealthy[Load Pre-computed<br/>Healthy Genomes]
GetStats --> GenM[generate_m_variants_cudf:<br/>Male Patients]
GetStats --> GenF[generate_f_variants_cudf:<br/>Female Patients]
LoadHealthy --> LoadM[Load Male<br/>Reference]
LoadHealthy --> LoadF[Load Female<br/>Reference]
GenM --> MakeVCF[make_vcfs:<br/>Generate VCF Files]
GenF --> MakeVCF
LoadM --> MakeVCF
LoadF --> MakeVCF
MakeVCF --> FilterVCF[FILTER_VCF:<br/>Extract Coding Variants]
MakeVCF --> VCF2Prot[VCF2PROT:<br/>Generate Mutated Proteins]
FilterVCF --> PredictExpr[PREDICT_EXPRESSION:<br/>Borzoi RNA Prediction]
PredictExpr --> RNA2Prot[RNA2PROTEXPRESSION:<br/>Protein Prediction]
PredictExpr --> CORTO[CORTO:<br/>Metabolome Prediction]
PredictExpr --> Convert[CONVERT_TO_TXT:<br/>Format Conversion]
Convert --> CiberFrac[CIBERSORTx_FRACTIONS:<br/>Cell Proportions]
CiberFrac --> CiberHires[CIBERSORTx_HIRES:<br/>Cell-Specific Expression]
CiberHires --> AddTissue[ADD_TISSUE_NAMES_TO_CIBERSORTX:<br/>Annotate Results]
RNA2Prot --> End([Complete<br/>Digital Patient])
CORTO --> End
AddTissue --> End
VCF2Prot --> End
style Start fill:#90EE90
style End fill:#FFB6C1
style PredictExpr fill:#87CEEB
Execution Example
1. Configuration (params.json)
{
"disease": "schizophrenia",
"n_pat": 10,
"percent_male": 0.5,
"container_borzoi": "harbor.cluster.omic.ai/omic/digital-patients/borzoi:latest"
}
2. Launch Pipeline
nextflow run test.nf -params-file params.json
3. Nextflow Execution
N E X T F L O W ~ version 21.04.0
Launching `test.nf` [amazing_babbage] - revision: 1a2b3c4d
[Synthea] Submitted process > get_disease_stats_no_patients
[Synthea] Submitted process > generate_m_variants_cudf (1)
[Synthea] Submitted process > generate_f_variants_cudf (1)
[Stage] Completed process > make_vcfs (10 files)
[Borzoi] Submitted process > FILTER_VCF (10)
[Borzoi] Submitted process > PREDICT_EXPRESSION (10)
...
Pipeline completed successfully!
4. Directory Structure
/outdir/
├── vcf2prot/
│ ├── Patient_001_transcript_id_mutations.fasta
│ └── Patient_002_transcript_id_mutations.fasta
├── borzoi/
│ ├── Patient_001_TPM.csv
│ └── Patient_002_TPM.csv
├── rna2protein/
│ ├── Patient_001_Protein_Expression_log2.csv
│ └── Patient_002_Protein_Expression_log2.csv
├── corto/
│ ├── Patient_001_metabolome.csv
│ └── Patient_002_metabolome.csv
└── ecotyper/
├── fractions/
│ └── Patient_001_CIBERSORTx_Results.txt
└── hires/
└── Patient_001_immune_cells.csv
Technical Architecture
Docker Containers
Each pipeline component runs in an isolated Docker container with specific dependencies:
flowchart LR
subgraph Docker Images
Synthea_Img[Synthea Container:<br/>- Java JDK<br/>- Python<br/>- BCFtools<br/>- GATK]
Borzoi_Img[Borzoi Container:<br/>- TensorFlow<br/>- PyTorch<br/>- Baskerville<br/>- Python packages]
VCF2Prot_Img[VCF2Prot Container:<br/>- BCFtools<br/>- vcf2prot binary<br/>- Reference genomes]
RNA2Prot_Img[RNA2Protein Container:<br/>- PyTorch<br/>- Deep learning model<br/>- GO annotations]
CORTO_Img[CORTO Container:<br/>- R<br/>- corto package<br/>- Regulon data]
CIBERSORTx_Img[CIBERSORTx Container:<br/>- Python<br/>- R<br/>- CIBERSORTx binaries<br/>- Signature matrices]
end
Registry[Container Registry:<br/>harbor.cluster.omic.ai] --> Synthea_Img
Registry --> Borzoi_Img
Registry --> VCF2Prot_Img
Registry --> RNA2Prot_Img
Registry --> CORTO_Img
Registry --> CIBERSORTx_Img
Why Docker?
- Reproducibility: Same environment every run
- Isolation: Avoid dependency conflicts
- Portability: Run anywhere (laptop, cluster, cloud)
Example Dockerfile (Borzoi):
FROM tensorflow/tensorflow:2.12.0-gpu
# Install conda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
RUN bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda
# Create borzoi environment
RUN conda create -n borzoi python=3.9
RUN conda install -n borzoi tensorflow-gpu pandas numpy pysam
# Install baskerville (Borzoi's framework)
RUN git clone https://github.com/calico/borzoi.git /home/omic/borzoi
RUN pip install -e /home/omic/borzoi
# Download pre-trained models
RUN mkdir -p /home/omic/borzoi/saved_models
RUN wget <model_url> -O /home/omic/borzoi/saved_models/f0/model0_best.h5
# Set entrypoint
CMD ["/bin/bash"]
Data Flow Architecture
flowchart TD
subgraph Input Data
UKBB[UK Biobank:<br/>Genetic Variants]
MANE[MANE Transcripts:<br/>Reference Sequences]
RefGenome[Reference Genome:<br/>GRCh38]
Regulon[Regulon Database:<br/>TF-Gene Networks]
LM22[LM22 Signature Matrix:<br/>Immune Cell Markers]
end
subgraph Processing
Synthea --> |VCF| Filter
Filter --> |Filtered VCF| Borzoi
Filter --> |Filtered VCF| VCF2Prot
Borzoi --> |TPM CSV| RNA2Prot
Borzoi --> |TPM CSV| CORTO
Borzoi --> |TPM CSV| CIBERSORTx
end
UKBB --> Synthea
MANE --> Borzoi
MANE --> VCF2Prot
RefGenome --> VCF2Prot
Regulon --> CORTO
LM22 --> CIBERSORTx
subgraph Output Data
RNA2Prot --> |Protein Expression| Results
CORTO --> |Metabolome| Results
CIBERSORTx --> |Immune Cells| Results
VCF2Prot --> |Mutated Proteins| Results
end
Computational Requirements
| Process | CPU | RAM | GPU | Time (per patient) |
|---|---|---|---|---|
| Synthea | 2 cores | 4 GB | No | ~5 min |
| FILTER_VCF | 4 cores | 4 GB | No | ~2 min |
| PREDICT_EXPRESSION | 8 cores | 4 GB | Yes (1x) | ~30-60 min |
| VCF2PROT | 2 cores | 2 GB | No | ~10 min |
| RNA2PROTEXPRESSION | 4 cores | 2 GB | Yes (1x) | ~5 min |
| CORTO | 2 cores | 1 GB | No | ~3 min |
| CIBERSORTx | 4 cores | 4 GB | No | ~15 min |
Total time per patient: ~70-90 minutes
Parallelization:
Nextflow automatically parallelizes patient processing:
- 10 patients with 4 GPUs → ~20-25 minutes total
- Patients are processed independently
Outputs and Applications
Complete Digital Patient Profile
flowchart LR
Patient[Digital Patient:<br/>Patient_001] --> Genome[Genomic Profile]
Patient --> Transcriptome[Transcriptomic Profile]
Patient --> Proteome[Proteomic Profile]
Patient --> Metabolome[Metabolomic Profile]
Patient --> Immune[Immune Profile]
Genome --> G1[Genetic Variants:<br/>SNPs, Indels]
Genome --> G2[Disease-Associated<br/>Mutations]
Transcriptome --> T1[RNA Expression:<br/>20,000+ genes]
Transcriptome --> T2[Tissue-Specific<br/>Expression]
Proteome --> P1[Protein Abundance:<br/>10,000+ proteins]
Proteome --> P2[Mutated Protein<br/>Sequences]
Metabolome --> M1[Pathway Activity:<br/>Metabolism]
Metabolome --> M2[Master Regulator<br/>TFs]
Immune --> I1[Cell Composition:<br/>T-cells, B-cells, etc.]
Immune --> I2[Cell-Specific<br/>Expression]
style Patient fill:#FFD700
Example Application: Cancer Research
flowchart TD
Question[Research Question:<br/>How do BRCA1 mutations<br/>affect breast cancer?]
Question --> Generate[Generate 100 synthetic<br/>patients with BRCA1<br/>mutations]
Generate --> Analyze{Analyze Digital Patients}
Analyze --> RNA[RNA Analysis:<br/>Find genes<br/>co-expressed with BRCA1]
Analyze --> Protein[Protein Analysis:<br/>Identify altered<br/>pathways]
Analyze --> Metabolome[Metabolome Analysis:<br/>Detect metabolic<br/>shifts]
Analyze --> Immune[Immune Analysis:<br/>Characterize immune<br/>infiltration]
RNA --> Insight[Insights into<br/>Disease Mechanisms]
Protein --> Insight
Metabolome --> Insight
Immune --> Insight
Insight --> Drug[Drug Target<br/>Discovery]
Insight --> Biomarker[Biomarker<br/>Identification]
style Question fill:#FFB6C1
style Insight fill:#90EE90
Output Files Summary
1. Genetic Variants (VCF)
- File:
Patient_001_variants.vcf - Content: All genetic variants for patient
- Use: Understand genetic basis of disease
2. RNA Expression (TPM)
- File:
Patient_001_TPM.csv - Content: Gene expression across 89 tissues
- Use: Identify dysregulated genes, tissue-specific effects
3. Protein Expression
- File:
Patient_001_Protein_Expression_log2.csv - Content: Predicted protein abundance
- Use: Understand functional consequences of RNA changes
4. Mutated Proteins (FASTA)
- File:
Patient_001_transcript_id_mutations.fasta - Content: Protein sequences with mutations
- Use: Study structural changes, predict drug binding
5. Metabolome
- File:
Patient_001_metabolome.csv - Content: Pathway activity scores
- Use: Understand metabolic reprogramming
6. Immune Cells
- File:
Patient_001_immune_cells.csv - Content: Cell type proportions and expression
- Use: Characterize immune microenvironment
Research Applications
-
Disease Mechanism Discovery
- Generate patients with specific mutations
- Compare to healthy controls
- Identify molecular changes caused by mutations
-
Drug Target Identification
- Find genes/proteins consistently altered across patients
- Prioritize targets for therapeutic intervention
-
Biomarker Discovery
- Identify molecular signatures distinguishing diseased from healthy
- Develop diagnostic tests
-
Precision Medicine
- Model individual patient molecular profiles
- Predict treatment response
- Personalize therapy
-
Clinical Trial Simulation
- Generate virtual patient cohorts
- Test hypotheses before expensive trials
- Power calculations and study design
-
Education and Training
- Teach students about multi-omics analysis
- No patient privacy concerns
- Unlimited data generation
Conclusion
This Digital Patient Pipeline represents a cutting-edge integration of:
- Synthetic biology: Realistic patient simulation
- Deep learning: RNA expression prediction from DNA
- Bioinformatics: Multi-omic data integration
- Workflow engineering: Scalable, reproducible pipelines
By combining these technologies, researchers can:
- Study disease mechanisms without accessing sensitive patient data
- Generate unlimited data for hypothesis testing
- Model personalized molecular phenotypes
- Accelerate drug discovery and precision medicine
The pipeline produces comprehensive molecular profiles spanning:
- Genomics (DNA variants)
- Transcriptomics (RNA expression)
- Proteomics (protein abundance)
- Metabolomics (metabolic activity)
- Immunomics (immune cell composition)
This multi-omic integration provides unprecedented insight into how genetic variants cascade through biological systems to produce disease phenotypes.
Glossary of Terms
Bioinformatics Terms:
- TPM: Transcripts Per Million - normalized measure of RNA abundance
- VCF: Variant Call Format - standard format for genetic variants
- FASTA: Text format for representing DNA/protein sequences
- SNP: Single Nucleotide Polymorphism - single base DNA variant
- Indel: Insertion or Deletion - adding or removing DNA bases
- Exon: Protein-coding segment of gene
- Intron: Non-coding segment removed during RNA processing
- Transcription: DNA → RNA
- Translation: RNA → Protein
- Gene Expression: Process of making protein from gene
Machine Learning Terms:
- Neural Network: Computer model inspired by brain structure
- Convolutional Layer: Detects local patterns in sequences
- Attention Layer: Learns long-range relationships
- Training: Teaching model from example data
- Prediction: Using model on new data
Pipeline Terms:
- Process: Individual computational step
- Channel: Data stream connecting processes
- Container: Isolated environment with dependencies
- Workflow: Series of connected processes
- Nextflow: Workflow orchestration system
Biological Terms:
- Genome: Complete set of DNA in organism
- Transcriptome: Complete set of RNA molecules
- Proteome: Complete set of proteins
- Metabolome: Complete set of metabolites
- Phenotype: Observable characteristics of organism
References
Pipeline Components
-
Nextflow: Di Tommaso et al. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35:316-319.
-
Synthea: Walone et al. (2017). Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. JAMIA, 25(3):230-238.
-
Borzoi: Linder et al. (2023). Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nature Genetics, 56:164-173.
-
CIBERSORTx: Newman et al. (2019). Determining cell-type abundance and expression from bulk tissues with digital cytometry. Nature Biotechnology, 37:773-782.
-
CORTO: Mercatelli et al. (2020). corto: a lightweight R package for gene network inference and master regulator analysis. Bioinformatics, 36(12):3916-3917.
Biological Background
-
Gene Structure: Lim et al. (2018). The exon-intron gene structure upstream of the initiation codon. Nucleic Acids Research, 46(5):2232-2244.
-
TPM Normalization: Zhao et al. (2020). Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols. RNA, 26(8):903-909.
-
VCF Format: Danecek et al. (2011). The variant call format and VCFtools. Bioinformatics, 27(15):2156-2158.
-
MANE: Morales et al. (2022). A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature, 604:310-315.
This documentation was created to provide comprehensive understanding of a complex bioinformatics pipeline for researchers without extensive biological background. For questions or clarifications, please consult the cited references or contact the pipeline maintainers.