digital-patients/documentaion/README.md

# Digital Patient Pipeline: Complete Documentation

## Table of Contents

1. [Introduction](#introduction)
2. [Pipeline Overview](#pipeline-overview)
3. [Biological Background](#biological-background)
4. [Pipeline Components](#pipeline-components)
5. [Workflow Execution](#workflow-execution)
6. [Technical Architecture](#technical-architecture)
7. [Outputs and Applications](#outputs-and-applications)

---

## Introduction

This document provides an exhaustive explanation of a **Digital Patient Pipeline** - a sophisticated bioinformatics workflow that generates synthetic patient data and predicts multiple layers of molecular biology from genomic variants. The pipeline is implemented using **Nextflow**, a workflow orchestration system, and integrates multiple cutting-edge computational biology tools to simulate how genetic mutations affect gene expression, protein production, immune cell composition, and metabolic activity.

### Purpose

The Digital Patient Pipeline serves several critical purposes:

- **Synthetic Patient Generation**: Creates realistic but synthetic patient profiles with genetic variants associated with specific diseases
- **Multi-Omic Prediction**: Predicts gene expression (RNA), protein abundance, immune cell composition, and metabolic activity from DNA sequence alone
- **Clinical Research**: Enables researchers to study disease mechanisms without accessing sensitive patient data
- **Personalized Medicine**: Models how individual genetic variants affect molecular phenotypes

---

## Pipeline Overview

```mermaid
flowchart TD
    Start([Start Pipeline]) --> Decision{Patient Type?}
    Decision -->|Disease| Synthea[Synthea: Generate Disease Patients]
    Decision -->|Healthy| Healthy[Synthea: Generate Healthy Patients]

    Synthea --> VCF[VCF Files: Genetic Variants]
    Healthy --> VCF

    VCF --> FilterVCF[Filter VCF: Extract Coding Variants]
    VCF --> VCF2Prot[VCF2Prot: Generate Mutated Proteins]

    FilterVCF --> Borzoi[Borzoi: Predict RNA Expression]

    Borzoi --> RNA2Prot[RNA2Protein: Predict Protein Expression]
    Borzoi --> CORTO[CORTO: Predict Metabolome]
    Borzoi --> CIBERSORTx[CIBERSORTx: Predict Immune Cells]

    RNA2Prot --> Output1[Protein Expression Profiles]
    CORTO --> Output2[Metabolic Activity Profiles]
    CIBERSORTx --> Output3[Immune Cell Composition]
    VCF2Prot --> Output4[Mutated Protein Sequences]

    Output1 --> End([Complete Digital Patient])
    Output2 --> End
    Output3 --> End
    Output4 --> End

    style Start fill:#90EE90
    style End fill:#FFB6C1
    style Borzoi fill:#87CEEB
    style Synthea fill:#DDA0DD
```

### Pipeline Flow Summary

1. **Patient Generation**: Synthea generates synthetic patients with realistic genetic variants
2. **Variant Processing**: VCF files containing genetic mutations are filtered and processed
3. **RNA Expression Prediction**: Borzoi predicts how mutations affect gene expression
4. **Downstream Analysis**: Multiple tools analyze predicted RNA to generate comprehensive molecular profiles
5. **Integration**: Results are combined to create a complete "digital patient"

---

## Biological Background

To understand this pipeline, we need to understand the central dogma of molecular biology and how genetic information flows through biological systems.

### The Central Dogma: DNA → RNA → Protein

```mermaid
flowchart LR
    DNA[DNA: Genetic Code] -->|Transcription| RNA[RNA: Message]
    RNA -->|Translation| Protein[Protein: Function]
    Protein --> Phenotype[Cellular Phenotype]

    style DNA fill:#FFE4B5
    style RNA fill:#E0FFFF
    style Protein fill:#FFE4E1
    style Phenotype fill:#F0E68C
```

#### 1. DNA (Deoxyribonucleic Acid)

**DNA** is the blueprint of life, containing the genetic instructions for all cellular functions. DNA consists of:

- **Four nucleotide bases**: Adenine (A), Thymine (T), Guanine (G), Cytosine (C)
- **Double helix structure**: Two complementary strands wound together
- **Genes**: Specific segments of DNA that encode instructions for proteins

**Example DNA Sequence**: `ATGCGATCCGTA`

#### 2. RNA (Ribonucleic Acid)

During **transcription**, DNA is copied into RNA:

- **RNA polymerase** enzyme reads DNA and creates a complementary RNA strand
- RNA uses **Uracil (U)** instead of Thymine (T)
- The RNA carries the genetic message from the nucleus to protein-making machinery

**Example RNA Sequence**: `AUGCGAUCCGUA` (from DNA above)

#### 3. Proteins

During **translation**, RNA is decoded to build proteins:

- **Ribosomes** read RNA in groups of three bases called **codons**
- Each codon specifies one **amino acid**
- Amino acids chain together to form proteins

**Example**: `AUG` → Methionine, `CGA` → Arginine, `UCC` → Serine, `GUA` → Valine

**Protein**: Methionine-Arginine-Serine-Valine

### Gene Structure

Genes in eukaryotes (organisms with nuclei, like humans) have a complex structure:

```mermaid
flowchart LR
    subgraph Gene Structure
        Promoter[Promoter: -1000bp] --> UTR5[5' UTR]
        UTR5 --> Exon1[Exon 1]
        Exon1 --> Intron1[Intron 1]
        Intron1 --> Exon2[Exon 2]
        Exon2 --> Intron2[Intron 2]
        Intron2 --> Exon3[Exon 3]
        Exon3 --> UTR3[3' UTR]
    end

    Exon1 -.->|Splicing| mRNA[Mature mRNA]
    Exon2 -.-> mRNA
    Exon3 -.-> mRNA

    style Exon1 fill:#90EE90
    style Exon2 fill:#90EE90
    style Exon3 fill:#90EE90
    style Intron1 fill:#FFB6C1
    style Intron2 fill:#FFB6C1
    style Promoter fill:#FFD700
```

**Key Components:**

- **Promoter**: Regulatory region upstream of gene (-1000 bp) that controls when the gene is turned on
- **5' UTR (Untranslated Region)**: Beginning of RNA that isn't translated into protein
- **Exons**: Segments that ARE kept in the final RNA and code for protein
- **Introns**: Segments that are REMOVED during RNA processing
- **3' UTR**: End region of RNA that isn't translated

**Why this matters**: The Borzoi model in this pipeline predicts which parts of genes will be transcribed into RNA, including both exons and introns, before they're processed.

### Genetic Variants and Their Effects

```mermaid
flowchart TD
    Variant[Genetic Variant] --> Type{Type?}

    Type -->|SNP| SNP[Single Nucleotide<br/>Polymorphism:<br/>A->G]
    Type -->|Insertion| INS[Insertion:<br/>ATCG->ATCGGG]
    Type -->|Deletion| DEL[Deletion:<br/>ATCG->A]

    SNP --> Effect1{Location?}
    INS --> Effect1
    DEL --> Effect1

    Effect1 -->|Promoter| E1[Changes expression level]
    Effect1 -->|Exon| E2[Changes protein sequence]
    Effect1 -->|Splice site| E3[Changes splicing pattern]
    Effect1 -->|Intron| E4[May affect regulation]

    style Variant fill:#FFB6C1
    style E1 fill:#87CEEB
    style E2 fill:#87CEEB
    style E3 fill:#87CEEB
    style E4 fill:#87CEEB


```

**Variants** are differences in DNA sequence between individuals:

- **SNP (Single Nucleotide Polymorphism)**: Single base change (e.g., A→G)
- **Insertion**: Extra bases added
- **Deletion**: Bases removed
- **Structural Variant**: Large-scale DNA rearrangements

These variants can affect:

- **Gene expression**: How much RNA is made
- **Protein sequence**: Which amino acids are in the protein
- **Splicing**: Which exons are included in mature RNA

### VCF Format: Storing Genetic Variants

The **VCF (Variant Call Format)** is a standardized text file format that stores genetic variants:

```
#CHROM  POS     ID      REF  ALT     QUAL  FILTER  INFO
chr1    12345   rs123   A    G       100   PASS    DP=50
chr2    67890   rs456   TC   T       95    PASS    DP=45
```

**Columns explained:**

- **CHROM**: Chromosome where variant is located (chr1, chr2, etc.)
- **POS**: Position on the chromosome (base pair number)
- **ID**: Database identifier (often from dbSNP database)
- **REF**: Reference base(s) at this position
- **ALT**: Alternate base(s) - the variant
- **QUAL**: Quality score for the variant call
- **FILTER**: Whether variant passed quality filters
- **INFO**: Additional information (e.g., read depth)

---

## Pipeline Components

### 1. Synthea: Synthetic Patient Generator

```mermaid
flowchart LR
    Input[Input Parameters] --> Synthea{Synthea Engine}

    subgraph Input Parameters
        Disease[Disease Type:<br/>schizophrenia, cancer, etc.]
        Demographics[Demographics:<br/>Age, Gender, Location]
        N[Number of Patients]
    end

    Synthea --> UKBB[UK Biobank<br/>Genetic Database]
    UKBB --> Disease_Variants[Disease-Associated<br/>Genetic Variants]

    Disease_Variants --> VCF_Out[VCF Files<br/>Patient_001.vcf<br/>Patient_002.vcf]

    style Synthea fill:#DDA0DD
    style VCF_Out fill:#90EE90
```

**What is Synthea?**

Synthea™ is an open-source synthetic patient generator that creates realistic but completely fake patient data. It models:

- Medical history
- Demographics (age, gender, location)
- Health conditions
- Medications and treatments
- Genetic variants

**How it works in this pipeline:**

1. **User specifies disease**: For example, "schizophrenia" or "healthy"
2. **Statistical analysis**: Synthea analyzes the UK Biobank database (a large collection of genetic data from ~500,000 individuals) to find variants statistically associated with the disease
3. **Probability-based sampling**: Variants are selected based on their frequency in diseased vs. healthy populations
4. **VCF generation**: Creates a VCF file for each synthetic patient containing their unique set of genetic variants

**Parameters:**

```nextflow
params.disease = 'schizophrenia'  // Disease to model
params.n_pat = 10                 // Number of patients to generate
params.percent_male = 0.5         // Gender distribution
```

**For Healthy Patients:**

If generating healthy controls, Synthea samples from pre-computed reference genomes representing the genetic diversity of healthy populations.

**Output Example:**

```
Patient_001_variants.vcf
Patient_002_variants.vcf
...
Patient_010_variants.vcf
```

### 2. Borzoi: RNA-seq Prediction from DNA

```mermaid
flowchart TD
    DNA[DNA Sequence<br/>524,288 bp] --> Borzoi[Borzoi Neural Network]

    subgraph Borzoi Architecture
        Conv1[Convolutional Layers:<br/>Learn local patterns]
        Conv1 --> Attention[Self-Attention Layers:<br/>Learn long-range interactions]
        Attention --> Upsample[Upsampling Layers:<br/>Increase resolution]
        Upsample --> Output_Layer[Output: RNA Coverage<br/>32 bp resolution]
    end

    Borzoi --> Tissues[Predictions for 89 Tissues/Cell Types]

    Tissues --> TPM[TPM Values:<br/>Transcripts Per Million]

    style Borzoi fill:#87CEEB
    style TPM fill:#90EE90
```

**What is Borzoi?**

Borzoi is a deep learning model developed at Calico Life Sciences that predicts **RNA-seq coverage** (how much RNA is produced from each part of the genome) directly from DNA sequence. This is revolutionary because it:

- Predicts gene expression without actually doing wet-lab RNA sequencing
- Accounts for multiple layers of regulation (transcription, splicing, polyadenylation)
- Provides tissue-specific predictions

**How does it work?**

1. **Input**: 524,288 base pairs of DNA sequence (524 kb)
2. **Neural Network Processing**:
   - **Convolutional layers**: Learn local DNA patterns (e.g., transcription factor binding sites)
   - **Self-attention layers**: Learn long-range interactions between regulatory elements
   - **Upsampling layers**: Increase resolution from 128 bp to 32 bp
3. **Output**: RNA coverage at 32 bp resolution for 89 different tissues/cell types

**Key Concept: RNA-seq Coverage**

RNA-seq coverage shows how many RNA molecules were sequenced at each position in the genome:

```
Position:    1000  1100  1200  1300  1400
Exon 1:      ████████████████
Intron:                       ░
Exon 2:                         ████████████
Coverage:    2.5   3.1   0.1   2.8   3.0
```

**TPM (Transcripts Per Million)**

Borzoi outputs are converted to **TPM** values:

```
TPM = (Number of reads mapped to transcript / Transcript length in kb)
      × (1,000,000 / Total reads in sample)
```

**Why TPM?**

- Normalizes for gene length (longer genes generate more reads)
- Normalizes for sequencing depth (accounts for total number of reads)
- Comparable across genes within a sample

**Pipeline Implementation:**

The pipeline has two Borzoi processes:

**Process 1: FILTER_VCF**

```python
# Extract variants in coding regions + 1000 bp upstream regulatory regions
# Create filtered VCF containing only protein-coding variants
```

**Process 2: PREDICT_EXPRESSION**

```python
# For each protein-coding gene with mutations:
#   1. Extract DNA sequence (reference + mutations)
#   2. Run Borzoi to predict RNA coverage
#   3. Calculate TPM by summing coverage over exons
#   4. Generate TPM matrix: Genes × Tissues
```

**Example Output:**

```csv
Gene,Adipose_Tissue,Brain_Cortex,Heart,Liver,Muscle
BRCA1,12.5,8.3,5.2,15.7,7.9
TP53,45.2,52.1,38.9,42.3,35.6
APOE,8.7,125.3,6.1,78.2,5.4
```

This table shows predicted RNA expression (TPM) for each gene in each tissue.

**MANE Dataset**

The pipeline uses the **MANE (Matched Annotation from NCBI and EMBL-EBI)** dataset:

- Contains reference transcript sequences for all human protein-coding genes
- Provides consensus between RefSeq and Ensembl/GENCODE annotations
- Includes exon/intron boundaries needed for TPM calculation

### 3. VCF2Prot: DNA Variants to Protein Sequences

```mermaid
flowchart TD
    VCF[VCF File:<br/>Genetic Variants] --> Annotate[BCFtools CSQ:<br/>Annotate Variants]
    Reference[Reference Genome<br/>GRCh38] --> Annotate
    GFF[Gene Annotations<br/>GFF3 Format] --> Annotate

    Annotate --> Annotated_VCF[Annotated VCF:<br/>Functional Consequences]

    Annotated_VCF --> VCF2Prot[VCF2Prot Tool]
    MANE_Ref[MANE Reference<br/>Transcripts] --> VCF2Prot

    VCF2Prot --> Mutated_Proteins[Mutated Protein<br/>Sequences FASTA]

    style VCF2Prot fill:#FFB6C1
    style Mutated_Proteins fill:#90EE90
```

**What is VCF2Prot?**

VCF2Prot is a tool that translates DNA variants into their effects on protein sequences. It:

- Takes variants from VCF files
- Maps them to gene transcripts
- Predicts how variants change the protein sequence
- Outputs mutated protein sequences

**Process Flow:**

1. **Variant Annotation (BCFtools CSQ)**

   - Maps variants to genes and transcripts
   - Determines functional consequence:
     - Missense: Changes one amino acid
     - Nonsense: Creates premature stop codon
     - Frameshift: Shifts reading frame
     - Synonymous: No change to amino acid

2. **Protein Sequence Prediction (VCF2Prot)**

   - Loads reference protein sequences from MANE
   - Applies variants to generate mutated sequences
   - Handles complex variants (insertions, deletions)

**Example:**

```
Reference DNA:  ATG GCT AAA TGC
Reference RNA:  AUG GCU AAA UGC
Reference Prot: Met-Ala-Lys-Cys

Variant: Position 5, G→T

Mutant DNA:     ATG TCT AAA TGC
Mutant RNA:     AUG UCU AAA UGC
Mutant Prot:    Met-Ser-Lys-Cys
                    ^^^
                Changed amino acid!
```

**Output Format: FASTA**

```
>Patient_001_ENST00000357654_BRCA1_p.G1738R
MSLQSQLFKQRQYLSIKTKRSTKEVLDATLIHQSITGLYETRIDLSQLGGD...
>Patient_001_ENST00000269305_TP53_p.R273H
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIE...
```

Each sequence shows:

- Patient ID
- Transcript ID
- Gene name
- Protein variant notation
- Full mutated protein sequence

### 4. RNA2ProteinExpression: RNA to Protein Prediction

```mermaid
flowchart TD
    RNA_TPM[RNA TPM Values<br/>from Borzoi] --> Model[Deep Learning Model]

    subgraph Neural Network
        Input[Input Layer:<br/>RNA Expression]
        Hidden1[Hidden Layer 1:<br/>Tissue Context]
        Hidden2[Hidden Layer 2:<br/>Translation Efficiency]
        Output[Output Layer:<br/>Protein Abundance]

        Input --> Hidden1
        Hidden1 --> Hidden2
        Hidden2 --> Output
    end

    Model --> Protein_Expr[Protein Expression<br/>Log2 Scale]

    style Model fill:#FFE4E1
    style Protein_Expr fill:#90EE90
```

**What is RNA2ProteinExpression?**

This is a custom deep learning model that predicts protein abundance from RNA expression levels. It's trained on:

- RNA-seq data (transcript levels)
- Mass spectrometry data (protein levels)
- Gene Ontology (GO) annotations

**Why is this needed?**

RNA levels don't perfectly correlate with protein levels because:

- **Translation efficiency** varies between genes
- **Protein stability** varies (some proteins are rapidly degraded)
- **Post-transcriptional regulation** (microRNAs, RNA-binding proteins)

Typical RNA-protein correlation: **r = 0.4-0.6** (not 1.0!)

**Model Architecture:**

The neural network learns:

- Which genes have high vs. low translation efficiency
- Tissue-specific effects on protein production
- GO term enrichments that affect protein stability

**Input:**

```
Gene_ID, Tissue, RNA_TPM
ENSG00000012048, Brain_Cortex, 125.3
ENSG00000012048, Liver, 78.2
```

**Output:**

```
Gene_ID, Tissue, Protein_Expression_log2
ENSG00000012048, Brain_Cortex, 8.5
ENSG00000012048, Liver, 7.2
```

**Log2 scale**: Protein expression in log2 transformed units (easier to interpret fold-changes)

### 5. CORTO: Metabolome Prediction

```mermaid
flowchart TD
    RNA_TPM[RNA TPM Matrix] --> CORTO[CORTO Algorithm]
    Regulon[Regulon Data:<br/>TF-Gene Relationships] --> CORTO

    subgraph CORTO Algorithm
        Correlation[1. Calculate Correlations:<br/>TF <-> Metabolic Genes]
        DPI[2. Data Processing Inequality:<br/>Remove Indirect Edges]
        Bootstrap[3. Bootstrap:<br/>Assess Robustness]
        MRA[4. Master Regulator Analysis:<br/>Identify Key TFs]

        Correlation --> DPI
        DPI --> Bootstrap
        Bootstrap --> MRA
    end

    CORTO --> Metabolome[Metabolome Predictions:<br/>Metabolic Activity]

    style CORTO fill:#F0E68C
    style Metabolome fill:#90EE90
```

**What is CORTO?**

CORTO (Correlation Tool) is an R package that infers gene regulatory networks and identifies master regulators controlling metabolic activity. It predicts:

- Activity of metabolic pathways
- Transcription factors (TFs) controlling metabolism
- Metabolite production levels

**How it works:**

1. **Input Regulon**: Pre-defined relationships between transcription factors (TFs) and their target genes (including metabolic enzymes)

2. **Correlation Analysis**: Calculate how TF expression correlates with target gene expression

3. **Data Processing Inequality (DPI)**: Remove indirect relationships

   - If TF1 → TF2 → Gene, remove direct TF1 → Gene edge
   - Keeps only direct regulatory relationships

4. **Bootstrap**: Test robustness by resampling data

5. **Master Regulator Analysis (MRA)**: Identify TFs whose target genes are significantly enriched in metabolic pathways

**Example:**

```
TF: PPARG (master regulator of fat metabolism)
Target Genes: FABP4, LPL, ADIPOQ, CD36, SCD (all involved in lipid metabolism)
Metabolome Prediction: High lipid synthesis activity
```

**Output:**

```csv
Pathway,Activity_Score,P_value
Glycolysis,-1.5,0.001
TCA_Cycle,2.3,0.0001
Fatty_Acid_Synthesis,1.8,0.002
```

- **Activity Score**: Positive = pathway activated, Negative = pathway suppressed
- **P-value**: Statistical significance

### 6. CIBERSORTx: Immune Cell Deconvolution

```mermaid
flowchart TD
    RNA_TPM[Bulk RNA TPM<br/>Mixed Cell Types] --> Signature[Signature Matrix:<br/>Cell-Specific Genes]

    subgraph CIBERSORTx
        Fractions[Step 1: Fractions<br/>Estimate Cell Proportions]
        HiRes[Step 2: HiRes<br/>Cell-Specific Expression]
    end

    RNA_TPM --> Fractions
    Signature --> Fractions

    Fractions --> Proportions[Cell Type Proportions]

    Proportions --> HiRes
    RNA_TPM --> HiRes
    Signature --> HiRes

    HiRes --> Cell_Specific[Cell-Type-Specific<br/>Gene Expression]

    style CIBERSORTx fill:#DDA0DD
    style Proportions fill:#90EE90
    style Cell_Specific fill:#90EE90
```

**What is CIBERSORTx?**

CIBERSORTx is a computational tool for **immune cell deconvolution**. When you sequence RNA from a tissue sample, you get a mixture of RNA from all cells in that tissue. CIBERSORTx:

- Estimates what proportion of cells are each immune cell type
- Infers cell-type-specific gene expression profiles

**Why is this important?**

Immune cells play crucial roles in:

- Fighting infections
- Cancer immunotherapy response
- Autoimmune diseases
- Inflammation

Understanding immune composition helps interpret disease mechanisms.

**How it works:**

**Step 1: Signature Matrix**

A reference matrix showing genes specifically expressed in each cell type:

```
Gene       T_cells  B_cells  Macrophages  NK_cells
CD3D       HIGH     low      low          low
CD19       low      HIGH     low          low
CD68       low      low      HIGH         low
NKG7       low      low      low          HIGH
```

**Step 2: CIBERSORTx Fractions**

Uses **Support Vector Regression (SVR)** to solve:

```
Bulk_Expression = Σ (Proportion_i × Signature_i)
```

Where:

- Bulk_Expression = measured RNA in tissue
- Proportion_i = fraction of cell type i
- Signature_i = expression pattern of cell type i

**Step 3: CIBERSORTx HiRes**

After knowing proportions, infer gene expression within each cell type by:

- Modeling tissue expression as weighted sum of cell-type contributions
- Deconvolving to separate cell-type-specific signals

**Example Output:**

**Fractions:**

```csv
Sample,CD8_T_cells,CD4_T_cells,B_cells,NK_cells,Monocytes
Patient_001_Brain,0.05,0.08,0.02,0.01,0.15
Patient_001_Liver,0.12,0.15,0.08,0.03,0.22
```

**HiRes:**

```csv
Tissue,Cell_Type,CD3D,CD19,CD68
Brain_Patient_001,CD8_T_cells,HIGH,low,low
Brain_Patient_001,B_cells,low,HIGH,low
```

**Pipeline Implementation:**

1. **CONVERT_TO_TXT**: Convert CSV to tab-delimited format (CIBERSORTx input format)

2. **CIBERSORTx_FRACTIONS**: Estimate cell proportions

3. **CIBERSORTx_HIRES**: Infer cell-specific expression

4. **ADD_TISSUE_NAMES**: Add tissue annotations to output

---

## Workflow Execution

### Nextflow: Workflow Orchestration

```mermaid
flowchart TD
    Config[nextflow.config:<br/>Configuration] --> NF[Nextflow Engine]
    Params[params.json:<br/>Parameters] --> NF

    subgraph Nextflow Engine
        Parse[Parse Workflow DSL]
        Schedule[Schedule Processes]
        Execute[Execute in Docker/Singularity]
        Monitor[Monitor & Checkpoint]

        Parse --> Schedule
        Schedule --> Execute
        Execute --> Monitor
    end

    NF --> Channels[Data Channels:<br/>Pass Files Between Processes]
    Channels --> Processes[Execute Processes]

    style NF fill:#87CEEB
```

**What is Nextflow?**

Nextflow is a workflow orchestration system specifically designed for data-intensive computational pipelines. It:

- Manages dependencies between analysis steps
- Handles parallel execution
- Provides automatic checkpointing (resume failed runs)
- Supports multiple execution platforms (local, HPC clusters, cloud)

**Key Concepts:**

1. **Processes**: Individual computational tasks (e.g., "PREDICT_EXPRESSION")
2. **Channels**: Data streams that connect processes
3. **Operators**: Manipulate channels (e.g., `mix`, `flatten`, `collect`)

**Example Process Definition:**

```nextflow
process PREDICT_EXPRESSION {
    container "${params.container_borzoi}"  // Docker image
    memory 4.GB                              // Memory requirement
    accelerator 1                            // GPU requirement

    input:
        path vcf_filtered                    // Input file
        path MANE                            // Reference data

    output:
        path "*_TPM.csv"                     // Output file pattern

    script:
    """
    #!/opt/conda/envs/borzoi/bin/python
    # Python script here
    """
}
```

**Channel Example:**

```nextflow
// Mix male and female patient VCFs
txt_ch = f_var.mix(m_var).flatten()

// This creates a channel with all VCF files:
// [Patient_001.vcf, Patient_002.vcf, ...]
```

### Complete Workflow

```mermaid
flowchart TD
    Start([Start]) --> CheckDisease{Disease or Healthy?}

    CheckDisease -->|Disease| GetStats[get_disease_stats_no_patients:<br/>Analyze UK Biobank]
    CheckDisease -->|Healthy| LoadHealthy[Load Pre-computed<br/>Healthy Genomes]

    GetStats --> GenM[generate_m_variants_cudf:<br/>Male Patients]
    GetStats --> GenF[generate_f_variants_cudf:<br/>Female Patients]

    LoadHealthy --> LoadM[Load Male<br/>Reference]
    LoadHealthy --> LoadF[Load Female<br/>Reference]

    GenM --> MakeVCF[make_vcfs:<br/>Generate VCF Files]
    GenF --> MakeVCF
    LoadM --> MakeVCF
    LoadF --> MakeVCF

    MakeVCF --> FilterVCF[FILTER_VCF:<br/>Extract Coding Variants]
    MakeVCF --> VCF2Prot[VCF2PROT:<br/>Generate Mutated Proteins]

    FilterVCF --> PredictExpr[PREDICT_EXPRESSION:<br/>Borzoi RNA Prediction]

    PredictExpr --> RNA2Prot[RNA2PROTEXPRESSION:<br/>Protein Prediction]
    PredictExpr --> CORTO[CORTO:<br/>Metabolome Prediction]
    PredictExpr --> Convert[CONVERT_TO_TXT:<br/>Format Conversion]

    Convert --> CiberFrac[CIBERSORTx_FRACTIONS:<br/>Cell Proportions]
    CiberFrac --> CiberHires[CIBERSORTx_HIRES:<br/>Cell-Specific Expression]
    CiberHires --> AddTissue[ADD_TISSUE_NAMES_TO_CIBERSORTX:<br/>Annotate Results]

    RNA2Prot --> End([Complete<br/>Digital Patient])
    CORTO --> End
    AddTissue --> End
    VCF2Prot --> End

    style Start fill:#90EE90
    style End fill:#FFB6C1
    style PredictExpr fill:#87CEEB
```

### Execution Example

**1. Configuration (params.json)**

```json
{
  "disease": "schizophrenia",
  "n_pat": 10,
  "percent_male": 0.5,
  "container_borzoi": "harbor.cluster.omic.ai/omic/digital-patients/borzoi:latest"
}
```

**2. Launch Pipeline**

```bash
nextflow run test.nf -params-file params.json
```

**3. Nextflow Execution**

```
N E X T F L O W  ~  version 21.04.0
Launching `test.nf` [amazing_babbage] - revision: 1a2b3c4d

[Synthea] Submitted process > get_disease_stats_no_patients
[Synthea] Submitted process > generate_m_variants_cudf (1)
[Synthea] Submitted process > generate_f_variants_cudf (1)
[Stage] Completed process > make_vcfs (10 files)
[Borzoi] Submitted process > FILTER_VCF (10)
[Borzoi] Submitted process > PREDICT_EXPRESSION (10)
...
Pipeline completed successfully!
```

**4. Directory Structure**

```
/outdir/
├── vcf2prot/
│   ├── Patient_001_transcript_id_mutations.fasta
│   └── Patient_002_transcript_id_mutations.fasta
├── borzoi/
│   ├── Patient_001_TPM.csv
│   └── Patient_002_TPM.csv
├── rna2protein/
│   ├── Patient_001_Protein_Expression_log2.csv
│   └── Patient_002_Protein_Expression_log2.csv
├── corto/
│   ├── Patient_001_metabolome.csv
│   └── Patient_002_metabolome.csv
└── ecotyper/
    ├── fractions/
    │   └── Patient_001_CIBERSORTx_Results.txt
    └── hires/
        └── Patient_001_immune_cells.csv
```

---

## Technical Architecture

### Docker Containers

Each pipeline component runs in an isolated Docker container with specific dependencies:

```mermaid
flowchart LR
    subgraph Docker Images
        Synthea_Img[Synthea Container:<br/>- Java JDK<br/>- Python<br/>- BCFtools<br/>- GATK]

        Borzoi_Img[Borzoi Container:<br/>- TensorFlow<br/>- PyTorch<br/>- Baskerville<br/>- Python packages]

        VCF2Prot_Img[VCF2Prot Container:<br/>- BCFtools<br/>- vcf2prot binary<br/>- Reference genomes]

        RNA2Prot_Img[RNA2Protein Container:<br/>- PyTorch<br/>- Deep learning model<br/>- GO annotations]

        CORTO_Img[CORTO Container:<br/>- R<br/>- corto package<br/>- Regulon data]

        CIBERSORTx_Img[CIBERSORTx Container:<br/>- Python<br/>- R<br/>- CIBERSORTx binaries<br/>- Signature matrices]
    end

    Registry[Container Registry:<br/>harbor.cluster.omic.ai] --> Synthea_Img
    Registry --> Borzoi_Img
    Registry --> VCF2Prot_Img
    Registry --> RNA2Prot_Img
    Registry --> CORTO_Img
    Registry --> CIBERSORTx_Img
```

**Why Docker?**

- **Reproducibility**: Same environment every run
- **Isolation**: Avoid dependency conflicts
- **Portability**: Run anywhere (laptop, cluster, cloud)

**Example Dockerfile (Borzoi):**

```dockerfile
FROM tensorflow/tensorflow:2.12.0-gpu

# Install conda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
RUN bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda

# Create borzoi environment
RUN conda create -n borzoi python=3.9
RUN conda install -n borzoi tensorflow-gpu pandas numpy pysam

# Install baskerville (Borzoi's framework)
RUN git clone https://github.com/calico/borzoi.git /home/omic/borzoi
RUN pip install -e /home/omic/borzoi

# Download pre-trained models
RUN mkdir -p /home/omic/borzoi/saved_models
RUN wget <model_url> -O /home/omic/borzoi/saved_models/f0/model0_best.h5

# Set entrypoint
CMD ["/bin/bash"]
```

### Data Flow Architecture

```mermaid
flowchart TD
    subgraph Input Data
        UKBB[UK Biobank:<br/>Genetic Variants]
        MANE[MANE Transcripts:<br/>Reference Sequences]
        RefGenome[Reference Genome:<br/>GRCh38]
        Regulon[Regulon Database:<br/>TF-Gene Networks]
        LM22[LM22 Signature Matrix:<br/>Immune Cell Markers]
    end

    subgraph Processing
        Synthea --> |VCF| Filter
        Filter --> |Filtered VCF| Borzoi
        Filter --> |Filtered VCF| VCF2Prot
        Borzoi --> |TPM CSV| RNA2Prot
        Borzoi --> |TPM CSV| CORTO
        Borzoi --> |TPM CSV| CIBERSORTx
    end

    UKBB --> Synthea
    MANE --> Borzoi
    MANE --> VCF2Prot
    RefGenome --> VCF2Prot
    Regulon --> CORTO
    LM22 --> CIBERSORTx

    subgraph Output Data
        RNA2Prot --> |Protein Expression| Results
        CORTO --> |Metabolome| Results
        CIBERSORTx --> |Immune Cells| Results
        VCF2Prot --> |Mutated Proteins| Results
    end
```

### Computational Requirements

| Process            | CPU     | RAM  | GPU      | Time (per patient) |
| ------------------ | ------- | ---- | -------- | ------------------ |
| Synthea            | 2 cores | 4 GB | No       | ~5 min             |
| FILTER_VCF         | 4 cores | 4 GB | No       | ~2 min             |
| PREDICT_EXPRESSION | 8 cores | 4 GB | Yes (1x) | ~30-60 min         |
| VCF2PROT           | 2 cores | 2 GB | No       | ~10 min            |
| RNA2PROTEXPRESSION | 4 cores | 2 GB | Yes (1x) | ~5 min             |
| CORTO              | 2 cores | 1 GB | No       | ~3 min             |
| CIBERSORTx         | 4 cores | 4 GB | No       | ~15 min            |

**Total time per patient: ~70-90 minutes**

**Parallelization:**

Nextflow automatically parallelizes patient processing:

- 10 patients with 4 GPUs → ~20-25 minutes total
- Patients are processed independently

---

## Outputs and Applications

### Complete Digital Patient Profile

```mermaid
flowchart LR
    Patient[Digital Patient:<br/>Patient_001] --> Genome[Genomic Profile]
    Patient --> Transcriptome[Transcriptomic Profile]
    Patient --> Proteome[Proteomic Profile]
    Patient --> Metabolome[Metabolomic Profile]
    Patient --> Immune[Immune Profile]

    Genome --> G1[Genetic Variants:<br/>SNPs, Indels]
    Genome --> G2[Disease-Associated<br/>Mutations]

    Transcriptome --> T1[RNA Expression:<br/>20,000+ genes]
    Transcriptome --> T2[Tissue-Specific<br/>Expression]

    Proteome --> P1[Protein Abundance:<br/>10,000+ proteins]
    Proteome --> P2[Mutated Protein<br/>Sequences]

    Metabolome --> M1[Pathway Activity:<br/>Metabolism]
    Metabolome --> M2[Master Regulator<br/>TFs]

    Immune --> I1[Cell Composition:<br/>T-cells, B-cells, etc.]
    Immune --> I2[Cell-Specific<br/>Expression]

    style Patient fill:#FFD700
```

### Example Application: Cancer Research

```mermaid
flowchart TD
    Question[Research Question:<br/>How do BRCA1 mutations<br/>affect breast cancer?]

    Question --> Generate[Generate 100 synthetic<br/>patients with BRCA1<br/>mutations]

    Generate --> Analyze{Analyze Digital Patients}

    Analyze --> RNA[RNA Analysis:<br/>Find genes<br/>co-expressed with BRCA1]
    Analyze --> Protein[Protein Analysis:<br/>Identify altered<br/>pathways]
    Analyze --> Metabolome[Metabolome Analysis:<br/>Detect metabolic<br/>shifts]
    Analyze --> Immune[Immune Analysis:<br/>Characterize immune<br/>infiltration]

    RNA --> Insight[Insights into<br/>Disease Mechanisms]
    Protein --> Insight
    Metabolome --> Insight
    Immune --> Insight

    Insight --> Drug[Drug Target<br/>Discovery]
    Insight --> Biomarker[Biomarker<br/>Identification]

    style Question fill:#FFB6C1
    style Insight fill:#90EE90
```

### Output Files Summary

**1. Genetic Variants (VCF)**

- **File**: `Patient_001_variants.vcf`
- **Content**: All genetic variants for patient
- **Use**: Understand genetic basis of disease

**2. RNA Expression (TPM)**

- **File**: `Patient_001_TPM.csv`
- **Content**: Gene expression across 89 tissues
- **Use**: Identify dysregulated genes, tissue-specific effects

**3. Protein Expression**

- **File**: `Patient_001_Protein_Expression_log2.csv`
- **Content**: Predicted protein abundance
- **Use**: Understand functional consequences of RNA changes

**4. Mutated Proteins (FASTA)**

- **File**: `Patient_001_transcript_id_mutations.fasta`
- **Content**: Protein sequences with mutations
- **Use**: Study structural changes, predict drug binding

**5. Metabolome**

- **File**: `Patient_001_metabolome.csv`
- **Content**: Pathway activity scores
- **Use**: Understand metabolic reprogramming

**6. Immune Cells**

- **File**: `Patient_001_immune_cells.csv`
- **Content**: Cell type proportions and expression
- **Use**: Characterize immune microenvironment

### Research Applications

1. **Disease Mechanism Discovery**

   - Generate patients with specific mutations
   - Compare to healthy controls
   - Identify molecular changes caused by mutations

2. **Drug Target Identification**

   - Find genes/proteins consistently altered across patients
   - Prioritize targets for therapeutic intervention

3. **Biomarker Discovery**

   - Identify molecular signatures distinguishing diseased from healthy
   - Develop diagnostic tests

4. **Precision Medicine**

   - Model individual patient molecular profiles
   - Predict treatment response
   - Personalize therapy

5. **Clinical Trial Simulation**

   - Generate virtual patient cohorts
   - Test hypotheses before expensive trials
   - Power calculations and study design

6. **Education and Training**

   - Teach students about multi-omics analysis
   - No patient privacy concerns
   - Unlimited data generation

---

## Conclusion

This Digital Patient Pipeline represents a cutting-edge integration of:

- **Synthetic biology**: Realistic patient simulation
- **Deep learning**: RNA expression prediction from DNA
- **Bioinformatics**: Multi-omic data integration
- **Workflow engineering**: Scalable, reproducible pipelines

By combining these technologies, researchers can:

- Study disease mechanisms without accessing sensitive patient data
- Generate unlimited data for hypothesis testing
- Model personalized molecular phenotypes
- Accelerate drug discovery and precision medicine

The pipeline produces comprehensive molecular profiles spanning:

- **Genomics** (DNA variants)
- **Transcriptomics** (RNA expression)
- **Proteomics** (protein abundance)
- **Metabolomics** (metabolic activity)
- **Immunomics** (immune cell composition)

This multi-omic integration provides unprecedented insight into how genetic variants cascade through biological systems to produce disease phenotypes.

---

## Glossary of Terms

**Bioinformatics Terms:**

- **TPM**: Transcripts Per Million - normalized measure of RNA abundance
- **VCF**: Variant Call Format - standard format for genetic variants
- **FASTA**: Text format for representing DNA/protein sequences
- **SNP**: Single Nucleotide Polymorphism - single base DNA variant
- **Indel**: Insertion or Deletion - adding or removing DNA bases
- **Exon**: Protein-coding segment of gene
- **Intron**: Non-coding segment removed during RNA processing
- **Transcription**: DNA → RNA
- **Translation**: RNA → Protein
- **Gene Expression**: Process of making protein from gene

**Machine Learning Terms:**

- **Neural Network**: Computer model inspired by brain structure
- **Convolutional Layer**: Detects local patterns in sequences
- **Attention Layer**: Learns long-range relationships
- **Training**: Teaching model from example data
- **Prediction**: Using model on new data

**Pipeline Terms:**

- **Process**: Individual computational step
- **Channel**: Data stream connecting processes
- **Container**: Isolated environment with dependencies
- **Workflow**: Series of connected processes
- **Nextflow**: Workflow orchestration system

**Biological Terms:**

- **Genome**: Complete set of DNA in organism
- **Transcriptome**: Complete set of RNA molecules
- **Proteome**: Complete set of proteins
- **Metabolome**: Complete set of metabolites
- **Phenotype**: Observable characteristics of organism

---

## References

### Pipeline Components

1. **Nextflow**: Di Tommaso et al. (2017). Nextflow enables reproducible computational workflows. *Nature Biotechnology*, 35:316-319.

2. **Synthea**: Walone et al. (2017). Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. *JAMIA*, 25(3):230-238.

3. **Borzoi**: Linder et al. (2023). Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. *Nature Genetics*, 56:164-173.

4. **CIBERSORTx**: Newman et al. (2019). Determining cell-type abundance and expression from bulk tissues with digital cytometry. *Nature Biotechnology*, 37:773-782.

5. **CORTO**: Mercatelli et al. (2020). corto: a lightweight R package for gene network inference and master regulator analysis. *Bioinformatics*, 36(12):3916-3917.

### Biological Background

6. **Gene Structure**: Lim et al. (2018). The exon-intron gene structure upstream of the initiation codon. *Nucleic Acids Research*, 46(5):2232-2244.

7. **TPM Normalization**: Zhao et al. (2020). Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols. *RNA*, 26(8):903-909.

8. **VCF Format**: Danecek et al. (2011). The variant call format and VCFtools. *Bioinformatics*, 27(15):2156-2158.

9. **MANE**: Morales et al. (2022). A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. *Nature*, 604:310-315.

---

*This documentation was created to provide comprehensive understanding of a complex bioinformatics pipeline for researchers without extensive biological background. For questions or clarifications, please consult the cited references or contact the pipeline maintainers.*