# Digital Patient Pipeline: Complete Documentation ## Table of Contents 1. [Introduction](#introduction) 2. [Pipeline Overview](#pipeline-overview) 3. [Biological Background](#biological-background) 4. [Pipeline Components](#pipeline-components) 5. [Workflow Execution](#workflow-execution) 6. [Technical Architecture](#technical-architecture) 7. [Outputs and Applications](#outputs-and-applications) --- ## Introduction This document provides an exhaustive explanation of a **Digital Patient Pipeline** - a sophisticated bioinformatics workflow that generates synthetic patient data and predicts multiple layers of molecular biology from genomic variants. The pipeline is implemented using **Nextflow**, a workflow orchestration system, and integrates multiple cutting-edge computational biology tools to simulate how genetic mutations affect gene expression, protein production, immune cell composition, and metabolic activity. ### Purpose The Digital Patient Pipeline serves several critical purposes: - **Synthetic Patient Generation**: Creates realistic but synthetic patient profiles with genetic variants associated with specific diseases - **Multi-Omic Prediction**: Predicts gene expression (RNA), protein abundance, immune cell composition, and metabolic activity from DNA sequence alone - **Clinical Research**: Enables researchers to study disease mechanisms without accessing sensitive patient data - **Personalized Medicine**: Models how individual genetic variants affect molecular phenotypes --- ## Pipeline Overview ```mermaid flowchart TD Start([Start Pipeline]) --> Decision{Patient Type?} Decision -->|Disease| Synthea[Synthea: Generate Disease Patients] Decision -->|Healthy| Healthy[Synthea: Generate Healthy Patients] Synthea --> VCF[VCF Files: Genetic Variants] Healthy --> VCF VCF --> FilterVCF[Filter VCF: Extract Coding Variants] VCF --> VCF2Prot[VCF2Prot: Generate Mutated Proteins] FilterVCF --> Borzoi[Borzoi: Predict RNA Expression] Borzoi --> RNA2Prot[RNA2Protein: Predict Protein Expression] Borzoi --> CORTO[CORTO: Predict Metabolome] Borzoi --> CIBERSORTx[CIBERSORTx: Predict Immune Cells] RNA2Prot --> Output1[Protein Expression Profiles] CORTO --> Output2[Metabolic Activity Profiles] CIBERSORTx --> Output3[Immune Cell Composition] VCF2Prot --> Output4[Mutated Protein Sequences] Output1 --> End([Complete Digital Patient]) Output2 --> End Output3 --> End Output4 --> End style Start fill:#90EE90 style End fill:#FFB6C1 style Borzoi fill:#87CEEB style Synthea fill:#DDA0DD ``` ### Pipeline Flow Summary 1. **Patient Generation**: Synthea generates synthetic patients with realistic genetic variants 2. **Variant Processing**: VCF files containing genetic mutations are filtered and processed 3. **RNA Expression Prediction**: Borzoi predicts how mutations affect gene expression 4. **Downstream Analysis**: Multiple tools analyze predicted RNA to generate comprehensive molecular profiles 5. **Integration**: Results are combined to create a complete "digital patient" --- ## Biological Background To understand this pipeline, we need to understand the central dogma of molecular biology and how genetic information flows through biological systems. ### The Central Dogma: DNA → RNA → Protein ```mermaid flowchart LR DNA[DNA: Genetic Code] -->|Transcription| RNA[RNA: Message] RNA -->|Translation| Protein[Protein: Function] Protein --> Phenotype[Cellular Phenotype] style DNA fill:#FFE4B5 style RNA fill:#E0FFFF style Protein fill:#FFE4E1 style Phenotype fill:#F0E68C ``` #### 1. DNA (Deoxyribonucleic Acid) **DNA** is the blueprint of life, containing the genetic instructions for all cellular functions. DNA consists of: - **Four nucleotide bases**: Adenine (A), Thymine (T), Guanine (G), Cytosine (C) - **Double helix structure**: Two complementary strands wound together - **Genes**: Specific segments of DNA that encode instructions for proteins **Example DNA Sequence**: `ATGCGATCCGTA` #### 2. RNA (Ribonucleic Acid) During **transcription**, DNA is copied into RNA: - **RNA polymerase** enzyme reads DNA and creates a complementary RNA strand - RNA uses **Uracil (U)** instead of Thymine (T) - The RNA carries the genetic message from the nucleus to protein-making machinery **Example RNA Sequence**: `AUGCGAUCCGUA` (from DNA above) #### 3. Proteins During **translation**, RNA is decoded to build proteins: - **Ribosomes** read RNA in groups of three bases called **codons** - Each codon specifies one **amino acid** - Amino acids chain together to form proteins **Example**: `AUG` → Methionine, `CGA` → Arginine, `UCC` → Serine, `GUA` → Valine **Protein**: Methionine-Arginine-Serine-Valine ### Gene Structure Genes in eukaryotes (organisms with nuclei, like humans) have a complex structure: ```mermaid flowchart LR subgraph Gene Structure Promoter[Promoter: -1000bp] --> UTR5[5' UTR] UTR5 --> Exon1[Exon 1] Exon1 --> Intron1[Intron 1] Intron1 --> Exon2[Exon 2] Exon2 --> Intron2[Intron 2] Intron2 --> Exon3[Exon 3] Exon3 --> UTR3[3' UTR] end Exon1 -.->|Splicing| mRNA[Mature mRNA] Exon2 -.-> mRNA Exon3 -.-> mRNA style Exon1 fill:#90EE90 style Exon2 fill:#90EE90 style Exon3 fill:#90EE90 style Intron1 fill:#FFB6C1 style Intron2 fill:#FFB6C1 style Promoter fill:#FFD700 ``` **Key Components:** - **Promoter**: Regulatory region upstream of gene (-1000 bp) that controls when the gene is turned on - **5' UTR (Untranslated Region)**: Beginning of RNA that isn't translated into protein - **Exons**: Segments that ARE kept in the final RNA and code for protein - **Introns**: Segments that are REMOVED during RNA processing - **3' UTR**: End region of RNA that isn't translated **Why this matters**: The Borzoi model in this pipeline predicts which parts of genes will be transcribed into RNA, including both exons and introns, before they're processed. ### Genetic Variants and Their Effects ```mermaid flowchart TD Variant[Genetic Variant] --> Type{Type?} Type -->|SNP| SNP[Single Nucleotide
Polymorphism:
A->G] Type -->|Insertion| INS[Insertion:
ATCG->ATCGGG] Type -->|Deletion| DEL[Deletion:
ATCG->A] SNP --> Effect1{Location?} INS --> Effect1 DEL --> Effect1 Effect1 -->|Promoter| E1[Changes expression level] Effect1 -->|Exon| E2[Changes protein sequence] Effect1 -->|Splice site| E3[Changes splicing pattern] Effect1 -->|Intron| E4[May affect regulation] style Variant fill:#FFB6C1 style E1 fill:#87CEEB style E2 fill:#87CEEB style E3 fill:#87CEEB style E4 fill:#87CEEB ``` **Variants** are differences in DNA sequence between individuals: - **SNP (Single Nucleotide Polymorphism)**: Single base change (e.g., A→G) - **Insertion**: Extra bases added - **Deletion**: Bases removed - **Structural Variant**: Large-scale DNA rearrangements These variants can affect: - **Gene expression**: How much RNA is made - **Protein sequence**: Which amino acids are in the protein - **Splicing**: Which exons are included in mature RNA ### VCF Format: Storing Genetic Variants The **VCF (Variant Call Format)** is a standardized text file format that stores genetic variants: ``` #CHROM POS ID REF ALT QUAL FILTER INFO chr1 12345 rs123 A G 100 PASS DP=50 chr2 67890 rs456 TC T 95 PASS DP=45 ``` **Columns explained:** - **CHROM**: Chromosome where variant is located (chr1, chr2, etc.) - **POS**: Position on the chromosome (base pair number) - **ID**: Database identifier (often from dbSNP database) - **REF**: Reference base(s) at this position - **ALT**: Alternate base(s) - the variant - **QUAL**: Quality score for the variant call - **FILTER**: Whether variant passed quality filters - **INFO**: Additional information (e.g., read depth) --- ## Pipeline Components ### 1. Synthea: Synthetic Patient Generator ```mermaid flowchart LR Input[Input Parameters] --> Synthea{Synthea Engine} subgraph Input Parameters Disease[Disease Type:
schizophrenia, cancer, etc.] Demographics[Demographics:
Age, Gender, Location] N[Number of Patients] end Synthea --> UKBB[UK Biobank
Genetic Database] UKBB --> Disease_Variants[Disease-Associated
Genetic Variants] Disease_Variants --> VCF_Out[VCF Files
Patient_001.vcf
Patient_002.vcf] style Synthea fill:#DDA0DD style VCF_Out fill:#90EE90 ``` **What is Synthea?** Synthea™ is an open-source synthetic patient generator that creates realistic but completely fake patient data. It models: - Medical history - Demographics (age, gender, location) - Health conditions - Medications and treatments - Genetic variants **How it works in this pipeline:** 1. **User specifies disease**: For example, "schizophrenia" or "healthy" 2. **Statistical analysis**: Synthea analyzes the UK Biobank database (a large collection of genetic data from ~500,000 individuals) to find variants statistically associated with the disease 3. **Probability-based sampling**: Variants are selected based on their frequency in diseased vs. healthy populations 4. **VCF generation**: Creates a VCF file for each synthetic patient containing their unique set of genetic variants **Parameters:** ```nextflow params.disease = 'schizophrenia' // Disease to model params.n_pat = 10 // Number of patients to generate params.percent_male = 0.5 // Gender distribution ``` **For Healthy Patients:** If generating healthy controls, Synthea samples from pre-computed reference genomes representing the genetic diversity of healthy populations. **Output Example:** ``` Patient_001_variants.vcf Patient_002_variants.vcf ... Patient_010_variants.vcf ``` ### 2. Borzoi: RNA-seq Prediction from DNA ```mermaid flowchart TD DNA[DNA Sequence
524,288 bp] --> Borzoi[Borzoi Neural Network] subgraph Borzoi Architecture Conv1[Convolutional Layers:
Learn local patterns] Conv1 --> Attention[Self-Attention Layers:
Learn long-range interactions] Attention --> Upsample[Upsampling Layers:
Increase resolution] Upsample --> Output_Layer[Output: RNA Coverage
32 bp resolution] end Borzoi --> Tissues[Predictions for 89 Tissues/Cell Types] Tissues --> TPM[TPM Values:
Transcripts Per Million] style Borzoi fill:#87CEEB style TPM fill:#90EE90 ``` **What is Borzoi?** Borzoi is a deep learning model developed at Calico Life Sciences that predicts **RNA-seq coverage** (how much RNA is produced from each part of the genome) directly from DNA sequence. This is revolutionary because it: - Predicts gene expression without actually doing wet-lab RNA sequencing - Accounts for multiple layers of regulation (transcription, splicing, polyadenylation) - Provides tissue-specific predictions **How does it work?** 1. **Input**: 524,288 base pairs of DNA sequence (524 kb) 2. **Neural Network Processing**: - **Convolutional layers**: Learn local DNA patterns (e.g., transcription factor binding sites) - **Self-attention layers**: Learn long-range interactions between regulatory elements - **Upsampling layers**: Increase resolution from 128 bp to 32 bp 3. **Output**: RNA coverage at 32 bp resolution for 89 different tissues/cell types **Key Concept: RNA-seq Coverage** RNA-seq coverage shows how many RNA molecules were sequenced at each position in the genome: ``` Position: 1000 1100 1200 1300 1400 Exon 1: ████████████████ Intron: ░ Exon 2: ████████████ Coverage: 2.5 3.1 0.1 2.8 3.0 ``` **TPM (Transcripts Per Million)** Borzoi outputs are converted to **TPM** values: ``` TPM = (Number of reads mapped to transcript / Transcript length in kb) × (1,000,000 / Total reads in sample) ``` **Why TPM?** - Normalizes for gene length (longer genes generate more reads) - Normalizes for sequencing depth (accounts for total number of reads) - Comparable across genes within a sample **Pipeline Implementation:** The pipeline has two Borzoi processes: **Process 1: FILTER_VCF** ```python # Extract variants in coding regions + 1000 bp upstream regulatory regions # Create filtered VCF containing only protein-coding variants ``` **Process 2: PREDICT_EXPRESSION** ```python # For each protein-coding gene with mutations: # 1. Extract DNA sequence (reference + mutations) # 2. Run Borzoi to predict RNA coverage # 3. Calculate TPM by summing coverage over exons # 4. Generate TPM matrix: Genes × Tissues ``` **Example Output:** ```csv Gene,Adipose_Tissue,Brain_Cortex,Heart,Liver,Muscle BRCA1,12.5,8.3,5.2,15.7,7.9 TP53,45.2,52.1,38.9,42.3,35.6 APOE,8.7,125.3,6.1,78.2,5.4 ``` This table shows predicted RNA expression (TPM) for each gene in each tissue. **MANE Dataset** The pipeline uses the **MANE (Matched Annotation from NCBI and EMBL-EBI)** dataset: - Contains reference transcript sequences for all human protein-coding genes - Provides consensus between RefSeq and Ensembl/GENCODE annotations - Includes exon/intron boundaries needed for TPM calculation ### 3. VCF2Prot: DNA Variants to Protein Sequences ```mermaid flowchart TD VCF[VCF File:
Genetic Variants] --> Annotate[BCFtools CSQ:
Annotate Variants] Reference[Reference Genome
GRCh38] --> Annotate GFF[Gene Annotations
GFF3 Format] --> Annotate Annotate --> Annotated_VCF[Annotated VCF:
Functional Consequences] Annotated_VCF --> VCF2Prot[VCF2Prot Tool] MANE_Ref[MANE Reference
Transcripts] --> VCF2Prot VCF2Prot --> Mutated_Proteins[Mutated Protein
Sequences FASTA] style VCF2Prot fill:#FFB6C1 style Mutated_Proteins fill:#90EE90 ``` **What is VCF2Prot?** VCF2Prot is a tool that translates DNA variants into their effects on protein sequences. It: - Takes variants from VCF files - Maps them to gene transcripts - Predicts how variants change the protein sequence - Outputs mutated protein sequences **Process Flow:** 1. **Variant Annotation (BCFtools CSQ)** - Maps variants to genes and transcripts - Determines functional consequence: - Missense: Changes one amino acid - Nonsense: Creates premature stop codon - Frameshift: Shifts reading frame - Synonymous: No change to amino acid 2. **Protein Sequence Prediction (VCF2Prot)** - Loads reference protein sequences from MANE - Applies variants to generate mutated sequences - Handles complex variants (insertions, deletions) **Example:** ``` Reference DNA: ATG GCT AAA TGC Reference RNA: AUG GCU AAA UGC Reference Prot: Met-Ala-Lys-Cys Variant: Position 5, G→T Mutant DNA: ATG TCT AAA TGC Mutant RNA: AUG UCU AAA UGC Mutant Prot: Met-Ser-Lys-Cys ^^^ Changed amino acid! ``` **Output Format: FASTA** ``` >Patient_001_ENST00000357654_BRCA1_p.G1738R MSLQSQLFKQRQYLSIKTKRSTKEVLDATLIHQSITGLYETRIDLSQLGGD... >Patient_001_ENST00000269305_TP53_p.R273H MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIE... ``` Each sequence shows: - Patient ID - Transcript ID - Gene name - Protein variant notation - Full mutated protein sequence ### 4. RNA2ProteinExpression: RNA to Protein Prediction ```mermaid flowchart TD RNA_TPM[RNA TPM Values
from Borzoi] --> Model[Deep Learning Model] subgraph Neural Network Input[Input Layer:
RNA Expression] Hidden1[Hidden Layer 1:
Tissue Context] Hidden2[Hidden Layer 2:
Translation Efficiency] Output[Output Layer:
Protein Abundance] Input --> Hidden1 Hidden1 --> Hidden2 Hidden2 --> Output end Model --> Protein_Expr[Protein Expression
Log2 Scale] style Model fill:#FFE4E1 style Protein_Expr fill:#90EE90 ``` **What is RNA2ProteinExpression?** This is a custom deep learning model that predicts protein abundance from RNA expression levels. It's trained on: - RNA-seq data (transcript levels) - Mass spectrometry data (protein levels) - Gene Ontology (GO) annotations **Why is this needed?** RNA levels don't perfectly correlate with protein levels because: - **Translation efficiency** varies between genes - **Protein stability** varies (some proteins are rapidly degraded) - **Post-transcriptional regulation** (microRNAs, RNA-binding proteins) Typical RNA-protein correlation: **r = 0.4-0.6** (not 1.0!) **Model Architecture:** The neural network learns: - Which genes have high vs. low translation efficiency - Tissue-specific effects on protein production - GO term enrichments that affect protein stability **Input:** ``` Gene_ID, Tissue, RNA_TPM ENSG00000012048, Brain_Cortex, 125.3 ENSG00000012048, Liver, 78.2 ``` **Output:** ``` Gene_ID, Tissue, Protein_Expression_log2 ENSG00000012048, Brain_Cortex, 8.5 ENSG00000012048, Liver, 7.2 ``` **Log2 scale**: Protein expression in log2 transformed units (easier to interpret fold-changes) ### 5. CORTO: Metabolome Prediction ```mermaid flowchart TD RNA_TPM[RNA TPM Matrix] --> CORTO[CORTO Algorithm] Regulon[Regulon Data:
TF-Gene Relationships] --> CORTO subgraph CORTO Algorithm Correlation[1. Calculate Correlations:
TF <-> Metabolic Genes] DPI[2. Data Processing Inequality:
Remove Indirect Edges] Bootstrap[3. Bootstrap:
Assess Robustness] MRA[4. Master Regulator Analysis:
Identify Key TFs] Correlation --> DPI DPI --> Bootstrap Bootstrap --> MRA end CORTO --> Metabolome[Metabolome Predictions:
Metabolic Activity] style CORTO fill:#F0E68C style Metabolome fill:#90EE90 ``` **What is CORTO?** CORTO (Correlation Tool) is an R package that infers gene regulatory networks and identifies master regulators controlling metabolic activity. It predicts: - Activity of metabolic pathways - Transcription factors (TFs) controlling metabolism - Metabolite production levels **How it works:** 1. **Input Regulon**: Pre-defined relationships between transcription factors (TFs) and their target genes (including metabolic enzymes) 2. **Correlation Analysis**: Calculate how TF expression correlates with target gene expression 3. **Data Processing Inequality (DPI)**: Remove indirect relationships - If TF1 → TF2 → Gene, remove direct TF1 → Gene edge - Keeps only direct regulatory relationships 4. **Bootstrap**: Test robustness by resampling data 5. **Master Regulator Analysis (MRA)**: Identify TFs whose target genes are significantly enriched in metabolic pathways **Example:** ``` TF: PPARG (master regulator of fat metabolism) Target Genes: FABP4, LPL, ADIPOQ, CD36, SCD (all involved in lipid metabolism) Metabolome Prediction: High lipid synthesis activity ``` **Output:** ```csv Pathway,Activity_Score,P_value Glycolysis,-1.5,0.001 TCA_Cycle,2.3,0.0001 Fatty_Acid_Synthesis,1.8,0.002 ``` - **Activity Score**: Positive = pathway activated, Negative = pathway suppressed - **P-value**: Statistical significance ### 6. CIBERSORTx: Immune Cell Deconvolution ```mermaid flowchart TD RNA_TPM[Bulk RNA TPM
Mixed Cell Types] --> Signature[Signature Matrix:
Cell-Specific Genes] subgraph CIBERSORTx Fractions[Step 1: Fractions
Estimate Cell Proportions] HiRes[Step 2: HiRes
Cell-Specific Expression] end RNA_TPM --> Fractions Signature --> Fractions Fractions --> Proportions[Cell Type Proportions] Proportions --> HiRes RNA_TPM --> HiRes Signature --> HiRes HiRes --> Cell_Specific[Cell-Type-Specific
Gene Expression] style CIBERSORTx fill:#DDA0DD style Proportions fill:#90EE90 style Cell_Specific fill:#90EE90 ``` **What is CIBERSORTx?** CIBERSORTx is a computational tool for **immune cell deconvolution**. When you sequence RNA from a tissue sample, you get a mixture of RNA from all cells in that tissue. CIBERSORTx: - Estimates what proportion of cells are each immune cell type - Infers cell-type-specific gene expression profiles **Why is this important?** Immune cells play crucial roles in: - Fighting infections - Cancer immunotherapy response - Autoimmune diseases - Inflammation Understanding immune composition helps interpret disease mechanisms. **How it works:** **Step 1: Signature Matrix** A reference matrix showing genes specifically expressed in each cell type: ``` Gene T_cells B_cells Macrophages NK_cells CD3D HIGH low low low CD19 low HIGH low low CD68 low low HIGH low NKG7 low low low HIGH ``` **Step 2: CIBERSORTx Fractions** Uses **Support Vector Regression (SVR)** to solve: ``` Bulk_Expression = Σ (Proportion_i × Signature_i) ``` Where: - Bulk_Expression = measured RNA in tissue - Proportion_i = fraction of cell type i - Signature_i = expression pattern of cell type i **Step 3: CIBERSORTx HiRes** After knowing proportions, infer gene expression within each cell type by: - Modeling tissue expression as weighted sum of cell-type contributions - Deconvolving to separate cell-type-specific signals **Example Output:** **Fractions:** ```csv Sample,CD8_T_cells,CD4_T_cells,B_cells,NK_cells,Monocytes Patient_001_Brain,0.05,0.08,0.02,0.01,0.15 Patient_001_Liver,0.12,0.15,0.08,0.03,0.22 ``` **HiRes:** ```csv Tissue,Cell_Type,CD3D,CD19,CD68 Brain_Patient_001,CD8_T_cells,HIGH,low,low Brain_Patient_001,B_cells,low,HIGH,low ``` **Pipeline Implementation:** 1. **CONVERT_TO_TXT**: Convert CSV to tab-delimited format (CIBERSORTx input format) 2. **CIBERSORTx_FRACTIONS**: Estimate cell proportions 3. **CIBERSORTx_HIRES**: Infer cell-specific expression 4. **ADD_TISSUE_NAMES**: Add tissue annotations to output --- ## Workflow Execution ### Nextflow: Workflow Orchestration ```mermaid flowchart TD Config[nextflow.config:
Configuration] --> NF[Nextflow Engine] Params[params.json:
Parameters] --> NF subgraph Nextflow Engine Parse[Parse Workflow DSL] Schedule[Schedule Processes] Execute[Execute in Docker/Singularity] Monitor[Monitor & Checkpoint] Parse --> Schedule Schedule --> Execute Execute --> Monitor end NF --> Channels[Data Channels:
Pass Files Between Processes] Channels --> Processes[Execute Processes] style NF fill:#87CEEB ``` **What is Nextflow?** Nextflow is a workflow orchestration system specifically designed for data-intensive computational pipelines. It: - Manages dependencies between analysis steps - Handles parallel execution - Provides automatic checkpointing (resume failed runs) - Supports multiple execution platforms (local, HPC clusters, cloud) **Key Concepts:** 1. **Processes**: Individual computational tasks (e.g., "PREDICT_EXPRESSION") 2. **Channels**: Data streams that connect processes 3. **Operators**: Manipulate channels (e.g., `mix`, `flatten`, `collect`) **Example Process Definition:** ```nextflow process PREDICT_EXPRESSION { container "${params.container_borzoi}" // Docker image memory 4.GB // Memory requirement accelerator 1 // GPU requirement input: path vcf_filtered // Input file path MANE // Reference data output: path "*_TPM.csv" // Output file pattern script: """ #!/opt/conda/envs/borzoi/bin/python # Python script here """ } ``` **Channel Example:** ```nextflow // Mix male and female patient VCFs txt_ch = f_var.mix(m_var).flatten() // This creates a channel with all VCF files: // [Patient_001.vcf, Patient_002.vcf, ...] ``` ### Complete Workflow ```mermaid flowchart TD Start([Start]) --> CheckDisease{Disease or Healthy?} CheckDisease -->|Disease| GetStats[get_disease_stats_no_patients:
Analyze UK Biobank] CheckDisease -->|Healthy| LoadHealthy[Load Pre-computed
Healthy Genomes] GetStats --> GenM[generate_m_variants_cudf:
Male Patients] GetStats --> GenF[generate_f_variants_cudf:
Female Patients] LoadHealthy --> LoadM[Load Male
Reference] LoadHealthy --> LoadF[Load Female
Reference] GenM --> MakeVCF[make_vcfs:
Generate VCF Files] GenF --> MakeVCF LoadM --> MakeVCF LoadF --> MakeVCF MakeVCF --> FilterVCF[FILTER_VCF:
Extract Coding Variants] MakeVCF --> VCF2Prot[VCF2PROT:
Generate Mutated Proteins] FilterVCF --> PredictExpr[PREDICT_EXPRESSION:
Borzoi RNA Prediction] PredictExpr --> RNA2Prot[RNA2PROTEXPRESSION:
Protein Prediction] PredictExpr --> CORTO[CORTO:
Metabolome Prediction] PredictExpr --> Convert[CONVERT_TO_TXT:
Format Conversion] Convert --> CiberFrac[CIBERSORTx_FRACTIONS:
Cell Proportions] CiberFrac --> CiberHires[CIBERSORTx_HIRES:
Cell-Specific Expression] CiberHires --> AddTissue[ADD_TISSUE_NAMES_TO_CIBERSORTX:
Annotate Results] RNA2Prot --> End([Complete
Digital Patient]) CORTO --> End AddTissue --> End VCF2Prot --> End style Start fill:#90EE90 style End fill:#FFB6C1 style PredictExpr fill:#87CEEB ``` ### Execution Example **1. Configuration (params.json)** ```json { "disease": "schizophrenia", "n_pat": 10, "percent_male": 0.5, "container_borzoi": "harbor.cluster.omic.ai/omic/digital-patients/borzoi:latest" } ``` **2. Launch Pipeline** ```bash nextflow run test.nf -params-file params.json ``` **3. Nextflow Execution** ``` N E X T F L O W ~ version 21.04.0 Launching `test.nf` [amazing_babbage] - revision: 1a2b3c4d [Synthea] Submitted process > get_disease_stats_no_patients [Synthea] Submitted process > generate_m_variants_cudf (1) [Synthea] Submitted process > generate_f_variants_cudf (1) [Stage] Completed process > make_vcfs (10 files) [Borzoi] Submitted process > FILTER_VCF (10) [Borzoi] Submitted process > PREDICT_EXPRESSION (10) ... Pipeline completed successfully! ``` **4. Directory Structure** ``` /outdir/ ├── vcf2prot/ │ ├── Patient_001_transcript_id_mutations.fasta │ └── Patient_002_transcript_id_mutations.fasta ├── borzoi/ │ ├── Patient_001_TPM.csv │ └── Patient_002_TPM.csv ├── rna2protein/ │ ├── Patient_001_Protein_Expression_log2.csv │ └── Patient_002_Protein_Expression_log2.csv ├── corto/ │ ├── Patient_001_metabolome.csv │ └── Patient_002_metabolome.csv └── ecotyper/ ├── fractions/ │ └── Patient_001_CIBERSORTx_Results.txt └── hires/ └── Patient_001_immune_cells.csv ``` --- ## Technical Architecture ### Docker Containers Each pipeline component runs in an isolated Docker container with specific dependencies: ```mermaid flowchart LR subgraph Docker Images Synthea_Img[Synthea Container:
- Java JDK
- Python
- BCFtools
- GATK] Borzoi_Img[Borzoi Container:
- TensorFlow
- PyTorch
- Baskerville
- Python packages] VCF2Prot_Img[VCF2Prot Container:
- BCFtools
- vcf2prot binary
- Reference genomes] RNA2Prot_Img[RNA2Protein Container:
- PyTorch
- Deep learning model
- GO annotations] CORTO_Img[CORTO Container:
- R
- corto package
- Regulon data] CIBERSORTx_Img[CIBERSORTx Container:
- Python
- R
- CIBERSORTx binaries
- Signature matrices] end Registry[Container Registry:
harbor.cluster.omic.ai] --> Synthea_Img Registry --> Borzoi_Img Registry --> VCF2Prot_Img Registry --> RNA2Prot_Img Registry --> CORTO_Img Registry --> CIBERSORTx_Img ``` **Why Docker?** - **Reproducibility**: Same environment every run - **Isolation**: Avoid dependency conflicts - **Portability**: Run anywhere (laptop, cluster, cloud) **Example Dockerfile (Borzoi):** ```dockerfile FROM tensorflow/tensorflow:2.12.0-gpu # Install conda RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh RUN bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda # Create borzoi environment RUN conda create -n borzoi python=3.9 RUN conda install -n borzoi tensorflow-gpu pandas numpy pysam # Install baskerville (Borzoi's framework) RUN git clone https://github.com/calico/borzoi.git /home/omic/borzoi RUN pip install -e /home/omic/borzoi # Download pre-trained models RUN mkdir -p /home/omic/borzoi/saved_models RUN wget -O /home/omic/borzoi/saved_models/f0/model0_best.h5 # Set entrypoint CMD ["/bin/bash"] ``` ### Data Flow Architecture ```mermaid flowchart TD subgraph Input Data UKBB[UK Biobank:
Genetic Variants] MANE[MANE Transcripts:
Reference Sequences] RefGenome[Reference Genome:
GRCh38] Regulon[Regulon Database:
TF-Gene Networks] LM22[LM22 Signature Matrix:
Immune Cell Markers] end subgraph Processing Synthea --> |VCF| Filter Filter --> |Filtered VCF| Borzoi Filter --> |Filtered VCF| VCF2Prot Borzoi --> |TPM CSV| RNA2Prot Borzoi --> |TPM CSV| CORTO Borzoi --> |TPM CSV| CIBERSORTx end UKBB --> Synthea MANE --> Borzoi MANE --> VCF2Prot RefGenome --> VCF2Prot Regulon --> CORTO LM22 --> CIBERSORTx subgraph Output Data RNA2Prot --> |Protein Expression| Results CORTO --> |Metabolome| Results CIBERSORTx --> |Immune Cells| Results VCF2Prot --> |Mutated Proteins| Results end ``` ### Computational Requirements | Process | CPU | RAM | GPU | Time (per patient) | | ------------------ | ------- | ---- | -------- | ------------------ | | Synthea | 2 cores | 4 GB | No | ~5 min | | FILTER_VCF | 4 cores | 4 GB | No | ~2 min | | PREDICT_EXPRESSION | 8 cores | 4 GB | Yes (1x) | ~30-60 min | | VCF2PROT | 2 cores | 2 GB | No | ~10 min | | RNA2PROTEXPRESSION | 4 cores | 2 GB | Yes (1x) | ~5 min | | CORTO | 2 cores | 1 GB | No | ~3 min | | CIBERSORTx | 4 cores | 4 GB | No | ~15 min | **Total time per patient: ~70-90 minutes** **Parallelization:** Nextflow automatically parallelizes patient processing: - 10 patients with 4 GPUs → ~20-25 minutes total - Patients are processed independently --- ## Outputs and Applications ### Complete Digital Patient Profile ```mermaid flowchart LR Patient[Digital Patient:
Patient_001] --> Genome[Genomic Profile] Patient --> Transcriptome[Transcriptomic Profile] Patient --> Proteome[Proteomic Profile] Patient --> Metabolome[Metabolomic Profile] Patient --> Immune[Immune Profile] Genome --> G1[Genetic Variants:
SNPs, Indels] Genome --> G2[Disease-Associated
Mutations] Transcriptome --> T1[RNA Expression:
20,000+ genes] Transcriptome --> T2[Tissue-Specific
Expression] Proteome --> P1[Protein Abundance:
10,000+ proteins] Proteome --> P2[Mutated Protein
Sequences] Metabolome --> M1[Pathway Activity:
Metabolism] Metabolome --> M2[Master Regulator
TFs] Immune --> I1[Cell Composition:
T-cells, B-cells, etc.] Immune --> I2[Cell-Specific
Expression] style Patient fill:#FFD700 ``` ### Example Application: Cancer Research ```mermaid flowchart TD Question[Research Question:
How do BRCA1 mutations
affect breast cancer?] Question --> Generate[Generate 100 synthetic
patients with BRCA1
mutations] Generate --> Analyze{Analyze Digital Patients} Analyze --> RNA[RNA Analysis:
Find genes
co-expressed with BRCA1] Analyze --> Protein[Protein Analysis:
Identify altered
pathways] Analyze --> Metabolome[Metabolome Analysis:
Detect metabolic
shifts] Analyze --> Immune[Immune Analysis:
Characterize immune
infiltration] RNA --> Insight[Insights into
Disease Mechanisms] Protein --> Insight Metabolome --> Insight Immune --> Insight Insight --> Drug[Drug Target
Discovery] Insight --> Biomarker[Biomarker
Identification] style Question fill:#FFB6C1 style Insight fill:#90EE90 ``` ### Output Files Summary **1. Genetic Variants (VCF)** - **File**: `Patient_001_variants.vcf` - **Content**: All genetic variants for patient - **Use**: Understand genetic basis of disease **2. RNA Expression (TPM)** - **File**: `Patient_001_TPM.csv` - **Content**: Gene expression across 89 tissues - **Use**: Identify dysregulated genes, tissue-specific effects **3. Protein Expression** - **File**: `Patient_001_Protein_Expression_log2.csv` - **Content**: Predicted protein abundance - **Use**: Understand functional consequences of RNA changes **4. Mutated Proteins (FASTA)** - **File**: `Patient_001_transcript_id_mutations.fasta` - **Content**: Protein sequences with mutations - **Use**: Study structural changes, predict drug binding **5. Metabolome** - **File**: `Patient_001_metabolome.csv` - **Content**: Pathway activity scores - **Use**: Understand metabolic reprogramming **6. Immune Cells** - **File**: `Patient_001_immune_cells.csv` - **Content**: Cell type proportions and expression - **Use**: Characterize immune microenvironment ### Research Applications 1. **Disease Mechanism Discovery** - Generate patients with specific mutations - Compare to healthy controls - Identify molecular changes caused by mutations 2. **Drug Target Identification** - Find genes/proteins consistently altered across patients - Prioritize targets for therapeutic intervention 3. **Biomarker Discovery** - Identify molecular signatures distinguishing diseased from healthy - Develop diagnostic tests 4. **Precision Medicine** - Model individual patient molecular profiles - Predict treatment response - Personalize therapy 5. **Clinical Trial Simulation** - Generate virtual patient cohorts - Test hypotheses before expensive trials - Power calculations and study design 6. **Education and Training** - Teach students about multi-omics analysis - No patient privacy concerns - Unlimited data generation --- ## Conclusion This Digital Patient Pipeline represents a cutting-edge integration of: - **Synthetic biology**: Realistic patient simulation - **Deep learning**: RNA expression prediction from DNA - **Bioinformatics**: Multi-omic data integration - **Workflow engineering**: Scalable, reproducible pipelines By combining these technologies, researchers can: - Study disease mechanisms without accessing sensitive patient data - Generate unlimited data for hypothesis testing - Model personalized molecular phenotypes - Accelerate drug discovery and precision medicine The pipeline produces comprehensive molecular profiles spanning: - **Genomics** (DNA variants) - **Transcriptomics** (RNA expression) - **Proteomics** (protein abundance) - **Metabolomics** (metabolic activity) - **Immunomics** (immune cell composition) This multi-omic integration provides unprecedented insight into how genetic variants cascade through biological systems to produce disease phenotypes. --- ## Glossary of Terms **Bioinformatics Terms:** - **TPM**: Transcripts Per Million - normalized measure of RNA abundance - **VCF**: Variant Call Format - standard format for genetic variants - **FASTA**: Text format for representing DNA/protein sequences - **SNP**: Single Nucleotide Polymorphism - single base DNA variant - **Indel**: Insertion or Deletion - adding or removing DNA bases - **Exon**: Protein-coding segment of gene - **Intron**: Non-coding segment removed during RNA processing - **Transcription**: DNA → RNA - **Translation**: RNA → Protein - **Gene Expression**: Process of making protein from gene **Machine Learning Terms:** - **Neural Network**: Computer model inspired by brain structure - **Convolutional Layer**: Detects local patterns in sequences - **Attention Layer**: Learns long-range relationships - **Training**: Teaching model from example data - **Prediction**: Using model on new data **Pipeline Terms:** - **Process**: Individual computational step - **Channel**: Data stream connecting processes - **Container**: Isolated environment with dependencies - **Workflow**: Series of connected processes - **Nextflow**: Workflow orchestration system **Biological Terms:** - **Genome**: Complete set of DNA in organism - **Transcriptome**: Complete set of RNA molecules - **Proteome**: Complete set of proteins - **Metabolome**: Complete set of metabolites - **Phenotype**: Observable characteristics of organism --- ## References ### Pipeline Components 1. **Nextflow**: Di Tommaso et al. (2017). Nextflow enables reproducible computational workflows. *Nature Biotechnology*, 35:316-319. 2. **Synthea**: Walone et al. (2017). Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. *JAMIA*, 25(3):230-238. 3. **Borzoi**: Linder et al. (2023). Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. *Nature Genetics*, 56:164-173. 4. **CIBERSORTx**: Newman et al. (2019). Determining cell-type abundance and expression from bulk tissues with digital cytometry. *Nature Biotechnology*, 37:773-782. 5. **CORTO**: Mercatelli et al. (2020). corto: a lightweight R package for gene network inference and master regulator analysis. *Bioinformatics*, 36(12):3916-3917. ### Biological Background 6. **Gene Structure**: Lim et al. (2018). The exon-intron gene structure upstream of the initiation codon. *Nucleic Acids Research*, 46(5):2232-2244. 7. **TPM Normalization**: Zhao et al. (2020). Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols. *RNA*, 26(8):903-909. 8. **VCF Format**: Danecek et al. (2011). The variant call format and VCFtools. *Bioinformatics*, 27(15):2156-2158. 9. **MANE**: Morales et al. (2022). A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. *Nature*, 604:310-315. --- *This documentation was created to provide comprehensive understanding of a complex bioinformatics pipeline for researchers without extensive biological background. For questions or clarifications, please consult the cited references or contact the pipeline maintainers.*