Files

Olamide Isreal 9e6a16c19b Initial commit: digital-patients pipeline (clean, no large files)

Large reference/model files excluded from repo - to be staged to S3 or baked into Docker images.

2026-03-26 15:15:23 +01:00

README.md

Initial commit: digital-patients pipeline (clean, no large files)

2026-03-26 15:15:23 +01:00

README.md

Digital Patient Pipeline: Complete Documentation

Introduction
Pipeline Overview
Biological Background
Pipeline Components
Workflow Execution
Technical Architecture
Outputs and Applications

Introduction

This document provides an exhaustive explanation of a Digital Patient Pipeline - a sophisticated bioinformatics workflow that generates synthetic patient data and predicts multiple layers of molecular biology from genomic variants. The pipeline is implemented using Nextflow, a workflow orchestration system, and integrates multiple cutting-edge computational biology tools to simulate how genetic mutations affect gene expression, protein production, immune cell composition, and metabolic activity.

Purpose

The Digital Patient Pipeline serves several critical purposes:

Synthetic Patient Generation: Creates realistic but synthetic patient profiles with genetic variants associated with specific diseases
Multi-Omic Prediction: Predicts gene expression (RNA), protein abundance, immune cell composition, and metabolic activity from DNA sequence alone
Clinical Research: Enables researchers to study disease mechanisms without accessing sensitive patient data
Personalized Medicine: Models how individual genetic variants affect molecular phenotypes

Pipeline Overview

flowchart TD
    Start([Start Pipeline]) --> Decision{Patient Type?}
    Decision -->|Disease| Synthea[Synthea: Generate Disease Patients]
    Decision -->|Healthy| Healthy[Synthea: Generate Healthy Patients]

    Synthea --> VCF[VCF Files: Genetic Variants]
    Healthy --> VCF

    VCF --> FilterVCF[Filter VCF: Extract Coding Variants]
    VCF --> VCF2Prot[VCF2Prot: Generate Mutated Proteins]

    FilterVCF --> Borzoi[Borzoi: Predict RNA Expression]

    Borzoi --> RNA2Prot[RNA2Protein: Predict Protein Expression]
    Borzoi --> CORTO[CORTO: Predict Metabolome]
    Borzoi --> CIBERSORTx[CIBERSORTx: Predict Immune Cells]

    RNA2Prot --> Output1[Protein Expression Profiles]
    CORTO --> Output2[Metabolic Activity Profiles]
    CIBERSORTx --> Output3[Immune Cell Composition]
    VCF2Prot --> Output4[Mutated Protein Sequences]

    Output1 --> End([Complete Digital Patient])
    Output2 --> End
    Output3 --> End
    Output4 --> End

    style Start fill:#90EE90
    style End fill:#FFB6C1
    style Borzoi fill:#87CEEB
    style Synthea fill:#DDA0DD

Pipeline Flow Summary

Patient Generation: Synthea generates synthetic patients with realistic genetic variants
Variant Processing: VCF files containing genetic mutations are filtered and processed
RNA Expression Prediction: Borzoi predicts how mutations affect gene expression
Downstream Analysis: Multiple tools analyze predicted RNA to generate comprehensive molecular profiles
Integration: Results are combined to create a complete "digital patient"

Biological Background

To understand this pipeline, we need to understand the central dogma of molecular biology and how genetic information flows through biological systems.

The Central Dogma: DNA → RNA → Protein

flowchart LR
    DNA[DNA: Genetic Code] -->|Transcription| RNA[RNA: Message]
    RNA -->|Translation| Protein[Protein: Function]
    Protein --> Phenotype[Cellular Phenotype]

    style DNA fill:#FFE4B5
    style RNA fill:#E0FFFF
    style Protein fill:#FFE4E1
    style Phenotype fill:#F0E68C

1. DNA (Deoxyribonucleic Acid)

DNA is the blueprint of life, containing the genetic instructions for all cellular functions. DNA consists of:

Four nucleotide bases: Adenine (A), Thymine (T), Guanine (G), Cytosine (C)
Double helix structure: Two complementary strands wound together
Genes: Specific segments of DNA that encode instructions for proteins

Example DNA Sequence: ATGCGATCCGTA

2. RNA (Ribonucleic Acid)

During transcription, DNA is copied into RNA:

RNA polymerase enzyme reads DNA and creates a complementary RNA strand
RNA uses Uracil (U) instead of Thymine (T)
The RNA carries the genetic message from the nucleus to protein-making machinery

Example RNA Sequence: AUGCGAUCCGUA (from DNA above)

3. Proteins

During translation, RNA is decoded to build proteins:

Ribosomes read RNA in groups of three bases called codons
Each codon specifies one amino acid
Amino acids chain together to form proteins

Example: AUG → Methionine, CGA → Arginine, UCC → Serine, GUA → Valine

Protein: Methionine-Arginine-Serine-Valine

Gene Structure

Genes in eukaryotes (organisms with nuclei, like humans) have a complex structure:

flowchart LR
    subgraph Gene Structure
        Promoter[Promoter: -1000bp] --> UTR5[5' UTR]
        UTR5 --> Exon1[Exon 1]
        Exon1 --> Intron1[Intron 1]
        Intron1 --> Exon2[Exon 2]
        Exon2 --> Intron2[Intron 2]
        Intron2 --> Exon3[Exon 3]
        Exon3 --> UTR3[3' UTR]
    end

    Exon1 -.->|Splicing| mRNA[Mature mRNA]
    Exon2 -.-> mRNA
    Exon3 -.-> mRNA

    style Exon1 fill:#90EE90
    style Exon2 fill:#90EE90
    style Exon3 fill:#90EE90
    style Intron1 fill:#FFB6C1
    style Intron2 fill:#FFB6C1
    style Promoter fill:#FFD700

Key Components:

Promoter: Regulatory region upstream of gene (-1000 bp) that controls when the gene is turned on
5' UTR (Untranslated Region): Beginning of RNA that isn't translated into protein
Exons: Segments that ARE kept in the final RNA and code for protein
Introns: Segments that are REMOVED during RNA processing
3' UTR: End region of RNA that isn't translated

Why this matters: The Borzoi model in this pipeline predicts which parts of genes will be transcribed into RNA, including both exons and introns, before they're processed.

Genetic Variants and Their Effects

flowchart TD
    Variant[Genetic Variant] --> Type{Type?}

    Type -->|SNP| SNP[Single Nucleotide<br/>Polymorphism:<br/>A->G]
    Type -->|Insertion| INS[Insertion:<br/>ATCG->ATCGGG]
    Type -->|Deletion| DEL[Deletion:<br/>ATCG->A]

    SNP --> Effect1{Location?}
    INS --> Effect1
    DEL --> Effect1

    Effect1 -->|Promoter| E1[Changes expression level]
    Effect1 -->|Exon| E2[Changes protein sequence]
    Effect1 -->|Splice site| E3[Changes splicing pattern]
    Effect1 -->|Intron| E4[May affect regulation]

    style Variant fill:#FFB6C1
    style E1 fill:#87CEEB
    style E2 fill:#87CEEB
    style E3 fill:#87CEEB
    style E4 fill:#87CEEB

Variants are differences in DNA sequence between individuals:

SNP (Single Nucleotide Polymorphism): Single base change (e.g., A→G)
Insertion: Extra bases added
Deletion: Bases removed
Structural Variant: Large-scale DNA rearrangements

These variants can affect:

Gene expression: How much RNA is made
Protein sequence: Which amino acids are in the protein
Splicing: Which exons are included in mature RNA

VCF Format: Storing Genetic Variants

The VCF (Variant Call Format) is a standardized text file format that stores genetic variants:

#CHROM  POS     ID      REF  ALT     QUAL  FILTER  INFO
chr1    12345   rs123   A    G       100   PASS    DP=50
chr2    67890   rs456   TC   T       95    PASS    DP=45

Columns explained:

CHROM: Chromosome where variant is located (chr1, chr2, etc.)
POS: Position on the chromosome (base pair number)
ID: Database identifier (often from dbSNP database)
REF: Reference base(s) at this position
ALT: Alternate base(s) - the variant
QUAL: Quality score for the variant call
FILTER: Whether variant passed quality filters
INFO: Additional information (e.g., read depth)

Pipeline Components

1. Synthea: Synthetic Patient Generator

flowchart LR
    Input[Input Parameters] --> Synthea{Synthea Engine}

    subgraph Input Parameters
        Disease[Disease Type:<br/>schizophrenia, cancer, etc.]
        Demographics[Demographics:<br/>Age, Gender, Location]
        N[Number of Patients]
    end

    Synthea --> UKBB[UK Biobank<br/>Genetic Database]
    UKBB --> Disease_Variants[Disease-Associated<br/>Genetic Variants]

    Disease_Variants --> VCF_Out[VCF Files<br/>Patient_001.vcf<br/>Patient_002.vcf]

    style Synthea fill:#DDA0DD
    style VCF_Out fill:#90EE90

What is Synthea?

Synthea™ is an open-source synthetic patient generator that creates realistic but completely fake patient data. It models:

Medical history
Demographics (age, gender, location)
Health conditions
Medications and treatments
Genetic variants

How it works in this pipeline:

User specifies disease: For example, "schizophrenia" or "healthy"
Statistical analysis: Synthea analyzes the UK Biobank database (a large collection of genetic data from ~500,000 individuals) to find variants statistically associated with the disease
Probability-based sampling: Variants are selected based on their frequency in diseased vs. healthy populations
VCF generation: Creates a VCF file for each synthetic patient containing their unique set of genetic variants

Parameters:

params.disease = 'schizophrenia'  // Disease to model
params.n_pat = 10                 // Number of patients to generate
params.percent_male = 0.5         // Gender distribution

For Healthy Patients:

If generating healthy controls, Synthea samples from pre-computed reference genomes representing the genetic diversity of healthy populations.

Output Example:

Patient_001_variants.vcf
Patient_002_variants.vcf
...
Patient_010_variants.vcf

2. Borzoi: RNA-seq Prediction from DNA

flowchart TD
    DNA[DNA Sequence<br/>524,288 bp] --> Borzoi[Borzoi Neural Network]

    subgraph Borzoi Architecture
        Conv1[Convolutional Layers:<br/>Learn local patterns]
        Conv1 --> Attention[Self-Attention Layers:<br/>Learn long-range interactions]
        Attention --> Upsample[Upsampling Layers:<br/>Increase resolution]
        Upsample --> Output_Layer[Output: RNA Coverage<br/>32 bp resolution]
    end

    Borzoi --> Tissues[Predictions for 89 Tissues/Cell Types]

    Tissues --> TPM[TPM Values:<br/>Transcripts Per Million]

    style Borzoi fill:#87CEEB
    style TPM fill:#90EE90

What is Borzoi?

Borzoi is a deep learning model developed at Calico Life Sciences that predicts RNA-seq coverage (how much RNA is produced from each part of the genome) directly from DNA sequence. This is revolutionary because it:

Predicts gene expression without actually doing wet-lab RNA sequencing
Accounts for multiple layers of regulation (transcription, splicing, polyadenylation)
Provides tissue-specific predictions

How does it work?

Input: 524,288 base pairs of DNA sequence (524 kb)
Neural Network Processing:
- Convolutional layers: Learn local DNA patterns (e.g., transcription factor binding sites)
- Self-attention layers: Learn long-range interactions between regulatory elements
- Upsampling layers: Increase resolution from 128 bp to 32 bp
Output: RNA coverage at 32 bp resolution for 89 different tissues/cell types

Key Concept: RNA-seq Coverage

RNA-seq coverage shows how many RNA molecules were sequenced at each position in the genome:

Position:    1000  1100  1200  1300  1400
Exon 1:      ████████████████
Intron:                       ░
Exon 2:                         ████████████
Coverage:    2.5   3.1   0.1   2.8   3.0

TPM (Transcripts Per Million)

Borzoi outputs are converted to TPM values:

TPM = (Number of reads mapped to transcript / Transcript length in kb) 
      × (1,000,000 / Total reads in sample)

Why TPM?

Normalizes for gene length (longer genes generate more reads)
Normalizes for sequencing depth (accounts for total number of reads)
Comparable across genes within a sample

Pipeline Implementation:

The pipeline has two Borzoi processes:

Process 1: FILTER_VCF

# Extract variants in coding regions + 1000 bp upstream regulatory regions
# Create filtered VCF containing only protein-coding variants

Process 2: PREDICT_EXPRESSION

# For each protein-coding gene with mutations:
#   1. Extract DNA sequence (reference + mutations)
#   2. Run Borzoi to predict RNA coverage
#   3. Calculate TPM by summing coverage over exons
#   4. Generate TPM matrix: Genes × Tissues

Example Output:

Gene,Adipose_Tissue,Brain_Cortex,Heart,Liver,Muscle
BRCA1,12.5,8.3,5.2,15.7,7.9
TP53,45.2,52.1,38.9,42.3,35.6
APOE,8.7,125.3,6.1,78.2,5.4

This table shows predicted RNA expression (TPM) for each gene in each tissue.

MANE Dataset

The pipeline uses the MANE (Matched Annotation from NCBI and EMBL-EBI) dataset:

Contains reference transcript sequences for all human protein-coding genes
Provides consensus between RefSeq and Ensembl/GENCODE annotations
Includes exon/intron boundaries needed for TPM calculation

3. VCF2Prot: DNA Variants to Protein Sequences

flowchart TD
    VCF[VCF File:<br/>Genetic Variants] --> Annotate[BCFtools CSQ:<br/>Annotate Variants]
    Reference[Reference Genome<br/>GRCh38] --> Annotate
    GFF[Gene Annotations<br/>GFF3 Format] --> Annotate

    Annotate --> Annotated_VCF[Annotated VCF:<br/>Functional Consequences]

    Annotated_VCF --> VCF2Prot[VCF2Prot Tool]
    MANE_Ref[MANE Reference<br/>Transcripts] --> VCF2Prot

    VCF2Prot --> Mutated_Proteins[Mutated Protein<br/>Sequences FASTA]

    style VCF2Prot fill:#FFB6C1
    style Mutated_Proteins fill:#90EE90

What is VCF2Prot?

VCF2Prot is a tool that translates DNA variants into their effects on protein sequences. It:

Takes variants from VCF files
Maps them to gene transcripts
Predicts how variants change the protein sequence
Outputs mutated protein sequences

Process Flow:

Variant Annotation (BCFtools CSQ)
- Maps variants to genes and transcripts
- Determines functional consequence:
  - Missense: Changes one amino acid
  - Nonsense: Creates premature stop codon
  - Frameshift: Shifts reading frame
  - Synonymous: No change to amino acid
Protein Sequence Prediction (VCF2Prot)
- Loads reference protein sequences from MANE
- Applies variants to generate mutated sequences
- Handles complex variants (insertions, deletions)

Example:

Reference DNA:  ATG GCT AAA TGC
Reference RNA:  AUG GCU AAA UGC
Reference Prot: Met-Ala-Lys-Cys

Variant: Position 5, G→T

Mutant DNA:     ATG TCT AAA TGC
Mutant RNA:     AUG UCU AAA UGC  
Mutant Prot:    Met-Ser-Lys-Cys
                    ^^^
                Changed amino acid!

Output Format: FASTA

>Patient_001_ENST00000357654_BRCA1_p.G1738R
MSLQSQLFKQRQYLSIKTKRSTKEVLDATLIHQSITGLYETRIDLSQLGGD...
>Patient_001_ENST00000269305_TP53_p.R273H
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIE...

Each sequence shows:

Patient ID
Transcript ID
Gene name
Protein variant notation
Full mutated protein sequence

4. RNA2ProteinExpression: RNA to Protein Prediction

flowchart TD
    RNA_TPM[RNA TPM Values<br/>from Borzoi] --> Model[Deep Learning Model]

    subgraph Neural Network
        Input[Input Layer:<br/>RNA Expression]
        Hidden1[Hidden Layer 1:<br/>Tissue Context]
        Hidden2[Hidden Layer 2:<br/>Translation Efficiency]
        Output[Output Layer:<br/>Protein Abundance]

        Input --> Hidden1
        Hidden1 --> Hidden2
        Hidden2 --> Output
    end

    Model --> Protein_Expr[Protein Expression<br/>Log2 Scale]

    style Model fill:#FFE4E1
    style Protein_Expr fill:#90EE90

What is RNA2ProteinExpression?

This is a custom deep learning model that predicts protein abundance from RNA expression levels. It's trained on:

RNA-seq data (transcript levels)
Mass spectrometry data (protein levels)
Gene Ontology (GO) annotations

Why is this needed?

RNA levels don't perfectly correlate with protein levels because:

Translation efficiency varies between genes
Protein stability varies (some proteins are rapidly degraded)
Post-transcriptional regulation (microRNAs, RNA-binding proteins)

Typical RNA-protein correlation: r = 0.4-0.6 (not 1.0!)

Model Architecture:

The neural network learns:

Which genes have high vs. low translation efficiency
Tissue-specific effects on protein production
GO term enrichments that affect protein stability

Input:

Gene_ID, Tissue, RNA_TPM
ENSG00000012048, Brain_Cortex, 125.3
ENSG00000012048, Liver, 78.2

Output:

Gene_ID, Tissue, Protein_Expression_log2
ENSG00000012048, Brain_Cortex, 8.5
ENSG00000012048, Liver, 7.2

Log2 scale: Protein expression in log2 transformed units (easier to interpret fold-changes)

5. CORTO: Metabolome Prediction

flowchart TD
    RNA_TPM[RNA TPM Matrix] --> CORTO[CORTO Algorithm]
    Regulon[Regulon Data:<br/>TF-Gene Relationships] --> CORTO

    subgraph CORTO Algorithm
        Correlation[1. Calculate Correlations:<br/>TF <-> Metabolic Genes]
        DPI[2. Data Processing Inequality:<br/>Remove Indirect Edges]
        Bootstrap[3. Bootstrap:<br/>Assess Robustness]
        MRA[4. Master Regulator Analysis:<br/>Identify Key TFs]

        Correlation --> DPI
        DPI --> Bootstrap
        Bootstrap --> MRA
    end

    CORTO --> Metabolome[Metabolome Predictions:<br/>Metabolic Activity]

    style CORTO fill:#F0E68C
    style Metabolome fill:#90EE90

What is CORTO?

CORTO (Correlation Tool) is an R package that infers gene regulatory networks and identifies master regulators controlling metabolic activity. It predicts:

Activity of metabolic pathways
Transcription factors (TFs) controlling metabolism
Metabolite production levels

How it works:

Input Regulon: Pre-defined relationships between transcription factors (TFs) and their target genes (including metabolic enzymes)
Correlation Analysis: Calculate how TF expression correlates with target gene expression
Data Processing Inequality (DPI): Remove indirect relationships
- If TF1 → TF2 → Gene, remove direct TF1 → Gene edge
- Keeps only direct regulatory relationships
Bootstrap: Test robustness by resampling data
Master Regulator Analysis (MRA): Identify TFs whose target genes are significantly enriched in metabolic pathways

Example:

TF: PPARG (master regulator of fat metabolism)
Target Genes: FABP4, LPL, ADIPOQ, CD36, SCD (all involved in lipid metabolism)
Metabolome Prediction: High lipid synthesis activity

Output:

Pathway,Activity_Score,P_value
Glycolysis,-1.5,0.001
TCA_Cycle,2.3,0.0001  
Fatty_Acid_Synthesis,1.8,0.002

Activity Score: Positive = pathway activated, Negative = pathway suppressed
P-value: Statistical significance

6. CIBERSORTx: Immune Cell Deconvolution

flowchart TD
    RNA_TPM[Bulk RNA TPM<br/>Mixed Cell Types] --> Signature[Signature Matrix:<br/>Cell-Specific Genes]

    subgraph CIBERSORTx
        Fractions[Step 1: Fractions<br/>Estimate Cell Proportions]
        HiRes[Step 2: HiRes<br/>Cell-Specific Expression]
    end

    RNA_TPM --> Fractions
    Signature --> Fractions

    Fractions --> Proportions[Cell Type Proportions]

    Proportions --> HiRes
    RNA_TPM --> HiRes
    Signature --> HiRes

    HiRes --> Cell_Specific[Cell-Type-Specific<br/>Gene Expression]

    style CIBERSORTx fill:#DDA0DD
    style Proportions fill:#90EE90
    style Cell_Specific fill:#90EE90

What is CIBERSORTx?

CIBERSORTx is a computational tool for immune cell deconvolution. When you sequence RNA from a tissue sample, you get a mixture of RNA from all cells in that tissue. CIBERSORTx:

Estimates what proportion of cells are each immune cell type
Infers cell-type-specific gene expression profiles

Why is this important?

Immune cells play crucial roles in:

Fighting infections
Cancer immunotherapy response
Autoimmune diseases
Inflammation

Understanding immune composition helps interpret disease mechanisms.

How it works:

Step 1: Signature Matrix

A reference matrix showing genes specifically expressed in each cell type:

Gene       T_cells  B_cells  Macrophages  NK_cells
CD3D       HIGH     low      low          low
CD19       low      HIGH     low          low
CD68       low      low      HIGH         low
NKG7       low      low      low          HIGH

Step 2: CIBERSORTx Fractions

Uses Support Vector Regression (SVR) to solve:

Bulk_Expression = Σ (Proportion_i × Signature_i)

Where:

Bulk_Expression = measured RNA in tissue
Proportion_i = fraction of cell type i
Signature_i = expression pattern of cell type i

Step 3: CIBERSORTx HiRes

After knowing proportions, infer gene expression within each cell type by:

Modeling tissue expression as weighted sum of cell-type contributions
Deconvolving to separate cell-type-specific signals

Example Output:

Fractions:

Sample,CD8_T_cells,CD4_T_cells,B_cells,NK_cells,Monocytes
Patient_001_Brain,0.05,0.08,0.02,0.01,0.15
Patient_001_Liver,0.12,0.15,0.08,0.03,0.22

HiRes:

Tissue,Cell_Type,CD3D,CD19,CD68
Brain_Patient_001,CD8_T_cells,HIGH,low,low
Brain_Patient_001,B_cells,low,HIGH,low

Pipeline Implementation:

CONVERT_TO_TXT: Convert CSV to tab-delimited format (CIBERSORTx input format)
CIBERSORTx_FRACTIONS: Estimate cell proportions
CIBERSORTx_HIRES: Infer cell-specific expression
ADD_TISSUE_NAMES: Add tissue annotations to output

Workflow Execution

Nextflow: Workflow Orchestration

flowchart TD
    Config[nextflow.config:<br/>Configuration] --> NF[Nextflow Engine]
    Params[params.json:<br/>Parameters] --> NF

    subgraph Nextflow Engine
        Parse[Parse Workflow DSL]
        Schedule[Schedule Processes]
        Execute[Execute in Docker/Singularity]
        Monitor[Monitor & Checkpoint]

        Parse --> Schedule
        Schedule --> Execute
        Execute --> Monitor
    end

    NF --> Channels[Data Channels:<br/>Pass Files Between Processes]
    Channels --> Processes[Execute Processes]

    style NF fill:#87CEEB

What is Nextflow?

Nextflow is a workflow orchestration system specifically designed for data-intensive computational pipelines. It:

Manages dependencies between analysis steps
Handles parallel execution
Provides automatic checkpointing (resume failed runs)
Supports multiple execution platforms (local, HPC clusters, cloud)

Key Concepts:

Processes: Individual computational tasks (e.g., "PREDICT_EXPRESSION")
Channels: Data streams that connect processes
Operators: Manipulate channels (e.g., mix, flatten, collect)

Example Process Definition:

process PREDICT_EXPRESSION {
    container "${params.container_borzoi}"  // Docker image
    memory 4.GB                              // Memory requirement
    accelerator 1                            // GPU requirement

    input:
        path vcf_filtered                    // Input file
        path MANE                            // Reference data

    output:
        path "*_TPM.csv"                     // Output file pattern

    script:
    """
    #!/opt/conda/envs/borzoi/bin/python
    # Python script here
    """
}

Channel Example:

// Mix male and female patient VCFs
txt_ch = f_var.mix(m_var).flatten()

// This creates a channel with all VCF files:
// [Patient_001.vcf, Patient_002.vcf, ...]

Complete Workflow

flowchart TD
    Start([Start]) --> CheckDisease{Disease or Healthy?}

    CheckDisease -->|Disease| GetStats[get_disease_stats_no_patients:<br/>Analyze UK Biobank]
    CheckDisease -->|Healthy| LoadHealthy[Load Pre-computed<br/>Healthy Genomes]

    GetStats --> GenM[generate_m_variants_cudf:<br/>Male Patients]
    GetStats --> GenF[generate_f_variants_cudf:<br/>Female Patients]

    LoadHealthy --> LoadM[Load Male<br/>Reference]
    LoadHealthy --> LoadF[Load Female<br/>Reference]

    GenM --> MakeVCF[make_vcfs:<br/>Generate VCF Files]
    GenF --> MakeVCF
    LoadM --> MakeVCF
    LoadF --> MakeVCF

    MakeVCF --> FilterVCF[FILTER_VCF:<br/>Extract Coding Variants]
    MakeVCF --> VCF2Prot[VCF2PROT:<br/>Generate Mutated Proteins]

    FilterVCF --> PredictExpr[PREDICT_EXPRESSION:<br/>Borzoi RNA Prediction]

    PredictExpr --> RNA2Prot[RNA2PROTEXPRESSION:<br/>Protein Prediction]
    PredictExpr --> CORTO[CORTO:<br/>Metabolome Prediction]
    PredictExpr --> Convert[CONVERT_TO_TXT:<br/>Format Conversion]

    Convert --> CiberFrac[CIBERSORTx_FRACTIONS:<br/>Cell Proportions]
    CiberFrac --> CiberHires[CIBERSORTx_HIRES:<br/>Cell-Specific Expression]
    CiberHires --> AddTissue[ADD_TISSUE_NAMES_TO_CIBERSORTX:<br/>Annotate Results]

    RNA2Prot --> End([Complete<br/>Digital Patient])
    CORTO --> End
    AddTissue --> End
    VCF2Prot --> End

    style Start fill:#90EE90
    style End fill:#FFB6C1
    style PredictExpr fill:#87CEEB

Execution Example

1. Configuration (params.json)

{
  "disease": "schizophrenia",
  "n_pat": 10,
  "percent_male": 0.5,
  "container_borzoi": "harbor.cluster.omic.ai/omic/digital-patients/borzoi:latest"
}

2. Launch Pipeline

nextflow run test.nf -params-file params.json

3. Nextflow Execution

N E X T F L O W  ~  version 21.04.0
Launching `test.nf` [amazing_babbage] - revision: 1a2b3c4d

[Synthea] Submitted process > get_disease_stats_no_patients
[Synthea] Submitted process > generate_m_variants_cudf (1)
[Synthea] Submitted process > generate_f_variants_cudf (1)
[Stage] Completed process > make_vcfs (10 files)
[Borzoi] Submitted process > FILTER_VCF (10)
[Borzoi] Submitted process > PREDICT_EXPRESSION (10)
...
Pipeline completed successfully!

4. Directory Structure

/outdir/
├── vcf2prot/
│   ├── Patient_001_transcript_id_mutations.fasta
│   └── Patient_002_transcript_id_mutations.fasta
├── borzoi/
│   ├── Patient_001_TPM.csv
│   └── Patient_002_TPM.csv
├── rna2protein/
│   ├── Patient_001_Protein_Expression_log2.csv
│   └── Patient_002_Protein_Expression_log2.csv
├── corto/
│   ├── Patient_001_metabolome.csv
│   └── Patient_002_metabolome.csv
└── ecotyper/
    ├── fractions/
    │   └── Patient_001_CIBERSORTx_Results.txt
    └── hires/
        └── Patient_001_immune_cells.csv

Technical Architecture

Docker Containers

Each pipeline component runs in an isolated Docker container with specific dependencies:

flowchart LR
    subgraph Docker Images
        Synthea_Img[Synthea Container:<br/>- Java JDK<br/>- Python<br/>- BCFtools<br/>- GATK]

        Borzoi_Img[Borzoi Container:<br/>- TensorFlow<br/>- PyTorch<br/>- Baskerville<br/>- Python packages]

        VCF2Prot_Img[VCF2Prot Container:<br/>- BCFtools<br/>- vcf2prot binary<br/>- Reference genomes]

        RNA2Prot_Img[RNA2Protein Container:<br/>- PyTorch<br/>- Deep learning model<br/>- GO annotations]

        CORTO_Img[CORTO Container:<br/>- R<br/>- corto package<br/>- Regulon data]

        CIBERSORTx_Img[CIBERSORTx Container:<br/>- Python<br/>- R<br/>- CIBERSORTx binaries<br/>- Signature matrices]
    end

    Registry[Container Registry:<br/>harbor.cluster.omic.ai] --> Synthea_Img
    Registry --> Borzoi_Img
    Registry --> VCF2Prot_Img
    Registry --> RNA2Prot_Img
    Registry --> CORTO_Img
    Registry --> CIBERSORTx_Img

Why Docker?

Reproducibility: Same environment every run
Isolation: Avoid dependency conflicts
Portability: Run anywhere (laptop, cluster, cloud)

Example Dockerfile (Borzoi):

FROM tensorflow/tensorflow:2.12.0-gpu

# Install conda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
RUN bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda

# Create borzoi environment
RUN conda create -n borzoi python=3.9
RUN conda install -n borzoi tensorflow-gpu pandas numpy pysam

# Install baskerville (Borzoi's framework)
RUN git clone https://github.com/calico/borzoi.git /home/omic/borzoi
RUN pip install -e /home/omic/borzoi

# Download pre-trained models
RUN mkdir -p /home/omic/borzoi/saved_models
RUN wget <model_url> -O /home/omic/borzoi/saved_models/f0/model0_best.h5

# Set entrypoint
CMD ["/bin/bash"]

Data Flow Architecture

flowchart TD
    subgraph Input Data
        UKBB[UK Biobank:<br/>Genetic Variants]
        MANE[MANE Transcripts:<br/>Reference Sequences]
        RefGenome[Reference Genome:<br/>GRCh38]
        Regulon[Regulon Database:<br/>TF-Gene Networks]
        LM22[LM22 Signature Matrix:<br/>Immune Cell Markers]
    end

    subgraph Processing
        Synthea --> |VCF| Filter
        Filter --> |Filtered VCF| Borzoi
        Filter --> |Filtered VCF| VCF2Prot
        Borzoi --> |TPM CSV| RNA2Prot
        Borzoi --> |TPM CSV| CORTO
        Borzoi --> |TPM CSV| CIBERSORTx
    end

    UKBB --> Synthea
    MANE --> Borzoi
    MANE --> VCF2Prot
    RefGenome --> VCF2Prot
    Regulon --> CORTO
    LM22 --> CIBERSORTx

    subgraph Output Data
        RNA2Prot --> |Protein Expression| Results
        CORTO --> |Metabolome| Results
        CIBERSORTx --> |Immune Cells| Results
        VCF2Prot --> |Mutated Proteins| Results
    end

Computational Requirements

Process	CPU	RAM	GPU	Time (per patient)
Synthea	2 cores	4 GB	No	~5 min
FILTER_VCF	4 cores	4 GB	No	~2 min
PREDICT_EXPRESSION	8 cores	4 GB	Yes (1x)	~30-60 min
VCF2PROT	2 cores	2 GB	No	~10 min
RNA2PROTEXPRESSION	4 cores	2 GB	Yes (1x)	~5 min
CORTO	2 cores	1 GB	No	~3 min
CIBERSORTx	4 cores	4 GB	No	~15 min

Total time per patient: ~70-90 minutes

Parallelization:

Nextflow automatically parallelizes patient processing:

10 patients with 4 GPUs → ~20-25 minutes total
Patients are processed independently

Outputs and Applications

Complete Digital Patient Profile

flowchart LR
    Patient[Digital Patient:<br/>Patient_001] --> Genome[Genomic Profile]
    Patient --> Transcriptome[Transcriptomic Profile]
    Patient --> Proteome[Proteomic Profile]
    Patient --> Metabolome[Metabolomic Profile]
    Patient --> Immune[Immune Profile]

    Genome --> G1[Genetic Variants:<br/>SNPs, Indels]
    Genome --> G2[Disease-Associated<br/>Mutations]

    Transcriptome --> T1[RNA Expression:<br/>20,000+ genes]
    Transcriptome --> T2[Tissue-Specific<br/>Expression]

    Proteome --> P1[Protein Abundance:<br/>10,000+ proteins]
    Proteome --> P2[Mutated Protein<br/>Sequences]

    Metabolome --> M1[Pathway Activity:<br/>Metabolism]
    Metabolome --> M2[Master Regulator<br/>TFs]

    Immune --> I1[Cell Composition:<br/>T-cells, B-cells, etc.]
    Immune --> I2[Cell-Specific<br/>Expression]

    style Patient fill:#FFD700

Example Application: Cancer Research

flowchart TD
    Question[Research Question:<br/>How do BRCA1 mutations<br/>affect breast cancer?]

    Question --> Generate[Generate 100 synthetic<br/>patients with BRCA1<br/>mutations]

    Generate --> Analyze{Analyze Digital Patients}

    Analyze --> RNA[RNA Analysis:<br/>Find genes<br/>co-expressed with BRCA1]
    Analyze --> Protein[Protein Analysis:<br/>Identify altered<br/>pathways]
    Analyze --> Metabolome[Metabolome Analysis:<br/>Detect metabolic<br/>shifts]
    Analyze --> Immune[Immune Analysis:<br/>Characterize immune<br/>infiltration]

    RNA --> Insight[Insights into<br/>Disease Mechanisms]
    Protein --> Insight
    Metabolome --> Insight
    Immune --> Insight

    Insight --> Drug[Drug Target<br/>Discovery]
    Insight --> Biomarker[Biomarker<br/>Identification]

    style Question fill:#FFB6C1
    style Insight fill:#90EE90

Output Files Summary

1. Genetic Variants (VCF)

File: Patient_001_variants.vcf
Content: All genetic variants for patient
Use: Understand genetic basis of disease

2. RNA Expression (TPM)

File: Patient_001_TPM.csv
Content: Gene expression across 89 tissues
Use: Identify dysregulated genes, tissue-specific effects

3. Protein Expression

File: Patient_001_Protein_Expression_log2.csv
Content: Predicted protein abundance
Use: Understand functional consequences of RNA changes

4. Mutated Proteins (FASTA)

File: Patient_001_transcript_id_mutations.fasta
Content: Protein sequences with mutations
Use: Study structural changes, predict drug binding

5. Metabolome

File: Patient_001_metabolome.csv
Content: Pathway activity scores
Use: Understand metabolic reprogramming

6. Immune Cells

File: Patient_001_immune_cells.csv
Content: Cell type proportions and expression
Use: Characterize immune microenvironment

Research Applications

Disease Mechanism Discovery
- Generate patients with specific mutations
- Compare to healthy controls
- Identify molecular changes caused by mutations
Drug Target Identification
- Find genes/proteins consistently altered across patients
- Prioritize targets for therapeutic intervention
Biomarker Discovery
- Identify molecular signatures distinguishing diseased from healthy
- Develop diagnostic tests
Precision Medicine
- Model individual patient molecular profiles
- Predict treatment response
- Personalize therapy
Clinical Trial Simulation
- Generate virtual patient cohorts
- Test hypotheses before expensive trials
- Power calculations and study design
Education and Training
- Teach students about multi-omics analysis
- No patient privacy concerns
- Unlimited data generation

Conclusion

This Digital Patient Pipeline represents a cutting-edge integration of:

Synthetic biology: Realistic patient simulation
Deep learning: RNA expression prediction from DNA
Bioinformatics: Multi-omic data integration
Workflow engineering: Scalable, reproducible pipelines

By combining these technologies, researchers can:

Study disease mechanisms without accessing sensitive patient data
Generate unlimited data for hypothesis testing
Model personalized molecular phenotypes
Accelerate drug discovery and precision medicine

The pipeline produces comprehensive molecular profiles spanning:

Genomics (DNA variants)
Transcriptomics (RNA expression)
Proteomics (protein abundance)
Metabolomics (metabolic activity)
Immunomics (immune cell composition)

This multi-omic integration provides unprecedented insight into how genetic variants cascade through biological systems to produce disease phenotypes.

Glossary of Terms

Bioinformatics Terms:

TPM: Transcripts Per Million - normalized measure of RNA abundance
VCF: Variant Call Format - standard format for genetic variants
FASTA: Text format for representing DNA/protein sequences
SNP: Single Nucleotide Polymorphism - single base DNA variant
Indel: Insertion or Deletion - adding or removing DNA bases
Exon: Protein-coding segment of gene
Intron: Non-coding segment removed during RNA processing
Transcription: DNA → RNA
Translation: RNA → Protein
Gene Expression: Process of making protein from gene

Machine Learning Terms:

Neural Network: Computer model inspired by brain structure
Convolutional Layer: Detects local patterns in sequences
Attention Layer: Learns long-range relationships
Training: Teaching model from example data
Prediction: Using model on new data

Pipeline Terms:

Process: Individual computational step
Channel: Data stream connecting processes
Container: Isolated environment with dependencies
Workflow: Series of connected processes
Nextflow: Workflow orchestration system

Biological Terms:

Genome: Complete set of DNA in organism
Transcriptome: Complete set of RNA molecules
Proteome: Complete set of proteins
Metabolome: Complete set of metabolites
Phenotype: Observable characteristics of organism

References

Pipeline Components

Nextflow: Di Tommaso et al. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35:316-319.
Synthea: Walone et al. (2017). Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. JAMIA, 25(3):230-238.
Borzoi: Linder et al. (2023). Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nature Genetics, 56:164-173.
CIBERSORTx: Newman et al. (2019). Determining cell-type abundance and expression from bulk tissues with digital cytometry. Nature Biotechnology, 37:773-782.
CORTO: Mercatelli et al. (2020). corto: a lightweight R package for gene network inference and master regulator analysis. Bioinformatics, 36(12):3916-3917.

Biological Background

Gene Structure: Lim et al. (2018). The exon-intron gene structure upstream of the initiation codon. Nucleic Acids Research, 46(5):2232-2244.
TPM Normalization: Zhao et al. (2020). Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols. RNA, 26(8):903-909.
VCF Format: Danecek et al. (2011). The variant call format and VCFtools. Bioinformatics, 27(15):2156-2158.
MANE: Morales et al. (2022). A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature, 604:310-315.

This documentation was created to provide comprehensive understanding of a complex bioinformatics pipeline for researchers without extensive biological background. For questions or clarifications, please consult the cited references or contact the pipeline maintainers.

README.md Unescape Escape

Digital Patient Pipeline: Complete Documentation

Table of Contents

Introduction

Purpose

Pipeline Overview

Pipeline Flow Summary

Biological Background

The Central Dogma: DNA → RNA → Protein

1. DNA (Deoxyribonucleic Acid)

2. RNA (Ribonucleic Acid)

3. Proteins

Gene Structure

Genetic Variants and Their Effects

VCF Format: Storing Genetic Variants

Pipeline Components

1. Synthea: Synthetic Patient Generator

2. Borzoi: RNA-seq Prediction from DNA

3. VCF2Prot: DNA Variants to Protein Sequences

4. RNA2ProteinExpression: RNA to Protein Prediction

5. CORTO: Metabolome Prediction

6. CIBERSORTx: Immune Cell Deconvolution

Workflow Execution

Nextflow: Workflow Orchestration

Complete Workflow

Execution Example

Technical Architecture

Docker Containers

Data Flow Architecture

Computational Requirements

Outputs and Applications

Complete Digital Patient Profile

Example Application: Cancer Research

Output Files Summary

Research Applications

Conclusion

Glossary of Terms

References

Pipeline Components

Biological Background

README.md