Large reference/model files excluded from repo - to be staged to S3 or baked into Docker images.
1242 lines
38 KiB
Markdown
1242 lines
38 KiB
Markdown
# Digital Patient Pipeline: Complete Documentation
|
||
|
||
## Table of Contents
|
||
|
||
1. [Introduction](#introduction)
|
||
2. [Pipeline Overview](#pipeline-overview)
|
||
3. [Biological Background](#biological-background)
|
||
4. [Pipeline Components](#pipeline-components)
|
||
5. [Workflow Execution](#workflow-execution)
|
||
6. [Technical Architecture](#technical-architecture)
|
||
7. [Outputs and Applications](#outputs-and-applications)
|
||
|
||
---
|
||
|
||
## Introduction
|
||
|
||
This document provides an exhaustive explanation of a **Digital Patient Pipeline** - a sophisticated bioinformatics workflow that generates synthetic patient data and predicts multiple layers of molecular biology from genomic variants. The pipeline is implemented using **Nextflow**, a workflow orchestration system, and integrates multiple cutting-edge computational biology tools to simulate how genetic mutations affect gene expression, protein production, immune cell composition, and metabolic activity.
|
||
|
||
### Purpose
|
||
|
||
The Digital Patient Pipeline serves several critical purposes:
|
||
|
||
- **Synthetic Patient Generation**: Creates realistic but synthetic patient profiles with genetic variants associated with specific diseases
|
||
- **Multi-Omic Prediction**: Predicts gene expression (RNA), protein abundance, immune cell composition, and metabolic activity from DNA sequence alone
|
||
- **Clinical Research**: Enables researchers to study disease mechanisms without accessing sensitive patient data
|
||
- **Personalized Medicine**: Models how individual genetic variants affect molecular phenotypes
|
||
|
||
---
|
||
|
||
## Pipeline Overview
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
Start([Start Pipeline]) --> Decision{Patient Type?}
|
||
Decision -->|Disease| Synthea[Synthea: Generate Disease Patients]
|
||
Decision -->|Healthy| Healthy[Synthea: Generate Healthy Patients]
|
||
|
||
Synthea --> VCF[VCF Files: Genetic Variants]
|
||
Healthy --> VCF
|
||
|
||
VCF --> FilterVCF[Filter VCF: Extract Coding Variants]
|
||
VCF --> VCF2Prot[VCF2Prot: Generate Mutated Proteins]
|
||
|
||
FilterVCF --> Borzoi[Borzoi: Predict RNA Expression]
|
||
|
||
Borzoi --> RNA2Prot[RNA2Protein: Predict Protein Expression]
|
||
Borzoi --> CORTO[CORTO: Predict Metabolome]
|
||
Borzoi --> CIBERSORTx[CIBERSORTx: Predict Immune Cells]
|
||
|
||
RNA2Prot --> Output1[Protein Expression Profiles]
|
||
CORTO --> Output2[Metabolic Activity Profiles]
|
||
CIBERSORTx --> Output3[Immune Cell Composition]
|
||
VCF2Prot --> Output4[Mutated Protein Sequences]
|
||
|
||
Output1 --> End([Complete Digital Patient])
|
||
Output2 --> End
|
||
Output3 --> End
|
||
Output4 --> End
|
||
|
||
style Start fill:#90EE90
|
||
style End fill:#FFB6C1
|
||
style Borzoi fill:#87CEEB
|
||
style Synthea fill:#DDA0DD
|
||
```
|
||
|
||
### Pipeline Flow Summary
|
||
|
||
1. **Patient Generation**: Synthea generates synthetic patients with realistic genetic variants
|
||
2. **Variant Processing**: VCF files containing genetic mutations are filtered and processed
|
||
3. **RNA Expression Prediction**: Borzoi predicts how mutations affect gene expression
|
||
4. **Downstream Analysis**: Multiple tools analyze predicted RNA to generate comprehensive molecular profiles
|
||
5. **Integration**: Results are combined to create a complete "digital patient"
|
||
|
||
---
|
||
|
||
## Biological Background
|
||
|
||
To understand this pipeline, we need to understand the central dogma of molecular biology and how genetic information flows through biological systems.
|
||
|
||
### The Central Dogma: DNA → RNA → Protein
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
DNA[DNA: Genetic Code] -->|Transcription| RNA[RNA: Message]
|
||
RNA -->|Translation| Protein[Protein: Function]
|
||
Protein --> Phenotype[Cellular Phenotype]
|
||
|
||
style DNA fill:#FFE4B5
|
||
style RNA fill:#E0FFFF
|
||
style Protein fill:#FFE4E1
|
||
style Phenotype fill:#F0E68C
|
||
```
|
||
|
||
#### 1. DNA (Deoxyribonucleic Acid)
|
||
|
||
**DNA** is the blueprint of life, containing the genetic instructions for all cellular functions. DNA consists of:
|
||
|
||
- **Four nucleotide bases**: Adenine (A), Thymine (T), Guanine (G), Cytosine (C)
|
||
- **Double helix structure**: Two complementary strands wound together
|
||
- **Genes**: Specific segments of DNA that encode instructions for proteins
|
||
|
||
**Example DNA Sequence**: `ATGCGATCCGTA`
|
||
|
||
#### 2. RNA (Ribonucleic Acid)
|
||
|
||
During **transcription**, DNA is copied into RNA:
|
||
|
||
- **RNA polymerase** enzyme reads DNA and creates a complementary RNA strand
|
||
- RNA uses **Uracil (U)** instead of Thymine (T)
|
||
- The RNA carries the genetic message from the nucleus to protein-making machinery
|
||
|
||
**Example RNA Sequence**: `AUGCGAUCCGUA` (from DNA above)
|
||
|
||
#### 3. Proteins
|
||
|
||
During **translation**, RNA is decoded to build proteins:
|
||
|
||
- **Ribosomes** read RNA in groups of three bases called **codons**
|
||
- Each codon specifies one **amino acid**
|
||
- Amino acids chain together to form proteins
|
||
|
||
**Example**: `AUG` → Methionine, `CGA` → Arginine, `UCC` → Serine, `GUA` → Valine
|
||
|
||
**Protein**: Methionine-Arginine-Serine-Valine
|
||
|
||
### Gene Structure
|
||
|
||
Genes in eukaryotes (organisms with nuclei, like humans) have a complex structure:
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
subgraph Gene Structure
|
||
Promoter[Promoter: -1000bp] --> UTR5[5' UTR]
|
||
UTR5 --> Exon1[Exon 1]
|
||
Exon1 --> Intron1[Intron 1]
|
||
Intron1 --> Exon2[Exon 2]
|
||
Exon2 --> Intron2[Intron 2]
|
||
Intron2 --> Exon3[Exon 3]
|
||
Exon3 --> UTR3[3' UTR]
|
||
end
|
||
|
||
Exon1 -.->|Splicing| mRNA[Mature mRNA]
|
||
Exon2 -.-> mRNA
|
||
Exon3 -.-> mRNA
|
||
|
||
style Exon1 fill:#90EE90
|
||
style Exon2 fill:#90EE90
|
||
style Exon3 fill:#90EE90
|
||
style Intron1 fill:#FFB6C1
|
||
style Intron2 fill:#FFB6C1
|
||
style Promoter fill:#FFD700
|
||
```
|
||
|
||
**Key Components:**
|
||
|
||
- **Promoter**: Regulatory region upstream of gene (-1000 bp) that controls when the gene is turned on
|
||
- **5' UTR (Untranslated Region)**: Beginning of RNA that isn't translated into protein
|
||
- **Exons**: Segments that ARE kept in the final RNA and code for protein
|
||
- **Introns**: Segments that are REMOVED during RNA processing
|
||
- **3' UTR**: End region of RNA that isn't translated
|
||
|
||
**Why this matters**: The Borzoi model in this pipeline predicts which parts of genes will be transcribed into RNA, including both exons and introns, before they're processed.
|
||
|
||
### Genetic Variants and Their Effects
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
Variant[Genetic Variant] --> Type{Type?}
|
||
|
||
Type -->|SNP| SNP[Single Nucleotide<br/>Polymorphism:<br/>A->G]
|
||
Type -->|Insertion| INS[Insertion:<br/>ATCG->ATCGGG]
|
||
Type -->|Deletion| DEL[Deletion:<br/>ATCG->A]
|
||
|
||
SNP --> Effect1{Location?}
|
||
INS --> Effect1
|
||
DEL --> Effect1
|
||
|
||
Effect1 -->|Promoter| E1[Changes expression level]
|
||
Effect1 -->|Exon| E2[Changes protein sequence]
|
||
Effect1 -->|Splice site| E3[Changes splicing pattern]
|
||
Effect1 -->|Intron| E4[May affect regulation]
|
||
|
||
style Variant fill:#FFB6C1
|
||
style E1 fill:#87CEEB
|
||
style E2 fill:#87CEEB
|
||
style E3 fill:#87CEEB
|
||
style E4 fill:#87CEEB
|
||
|
||
|
||
```
|
||
|
||
**Variants** are differences in DNA sequence between individuals:
|
||
|
||
- **SNP (Single Nucleotide Polymorphism)**: Single base change (e.g., A→G)
|
||
- **Insertion**: Extra bases added
|
||
- **Deletion**: Bases removed
|
||
- **Structural Variant**: Large-scale DNA rearrangements
|
||
|
||
These variants can affect:
|
||
|
||
- **Gene expression**: How much RNA is made
|
||
- **Protein sequence**: Which amino acids are in the protein
|
||
- **Splicing**: Which exons are included in mature RNA
|
||
|
||
### VCF Format: Storing Genetic Variants
|
||
|
||
The **VCF (Variant Call Format)** is a standardized text file format that stores genetic variants:
|
||
|
||
```
|
||
#CHROM POS ID REF ALT QUAL FILTER INFO
|
||
chr1 12345 rs123 A G 100 PASS DP=50
|
||
chr2 67890 rs456 TC T 95 PASS DP=45
|
||
```
|
||
|
||
**Columns explained:**
|
||
|
||
- **CHROM**: Chromosome where variant is located (chr1, chr2, etc.)
|
||
- **POS**: Position on the chromosome (base pair number)
|
||
- **ID**: Database identifier (often from dbSNP database)
|
||
- **REF**: Reference base(s) at this position
|
||
- **ALT**: Alternate base(s) - the variant
|
||
- **QUAL**: Quality score for the variant call
|
||
- **FILTER**: Whether variant passed quality filters
|
||
- **INFO**: Additional information (e.g., read depth)
|
||
|
||
---
|
||
|
||
## Pipeline Components
|
||
|
||
### 1. Synthea: Synthetic Patient Generator
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
Input[Input Parameters] --> Synthea{Synthea Engine}
|
||
|
||
subgraph Input Parameters
|
||
Disease[Disease Type:<br/>schizophrenia, cancer, etc.]
|
||
Demographics[Demographics:<br/>Age, Gender, Location]
|
||
N[Number of Patients]
|
||
end
|
||
|
||
Synthea --> UKBB[UK Biobank<br/>Genetic Database]
|
||
UKBB --> Disease_Variants[Disease-Associated<br/>Genetic Variants]
|
||
|
||
Disease_Variants --> VCF_Out[VCF Files<br/>Patient_001.vcf<br/>Patient_002.vcf]
|
||
|
||
style Synthea fill:#DDA0DD
|
||
style VCF_Out fill:#90EE90
|
||
```
|
||
|
||
**What is Synthea?**
|
||
|
||
Synthea™ is an open-source synthetic patient generator that creates realistic but completely fake patient data. It models:
|
||
|
||
- Medical history
|
||
- Demographics (age, gender, location)
|
||
- Health conditions
|
||
- Medications and treatments
|
||
- Genetic variants
|
||
|
||
**How it works in this pipeline:**
|
||
|
||
1. **User specifies disease**: For example, "schizophrenia" or "healthy"
|
||
2. **Statistical analysis**: Synthea analyzes the UK Biobank database (a large collection of genetic data from ~500,000 individuals) to find variants statistically associated with the disease
|
||
3. **Probability-based sampling**: Variants are selected based on their frequency in diseased vs. healthy populations
|
||
4. **VCF generation**: Creates a VCF file for each synthetic patient containing their unique set of genetic variants
|
||
|
||
**Parameters:**
|
||
|
||
```nextflow
|
||
params.disease = 'schizophrenia' // Disease to model
|
||
params.n_pat = 10 // Number of patients to generate
|
||
params.percent_male = 0.5 // Gender distribution
|
||
```
|
||
|
||
**For Healthy Patients:**
|
||
|
||
If generating healthy controls, Synthea samples from pre-computed reference genomes representing the genetic diversity of healthy populations.
|
||
|
||
**Output Example:**
|
||
|
||
```
|
||
Patient_001_variants.vcf
|
||
Patient_002_variants.vcf
|
||
...
|
||
Patient_010_variants.vcf
|
||
```
|
||
|
||
### 2. Borzoi: RNA-seq Prediction from DNA
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
DNA[DNA Sequence<br/>524,288 bp] --> Borzoi[Borzoi Neural Network]
|
||
|
||
subgraph Borzoi Architecture
|
||
Conv1[Convolutional Layers:<br/>Learn local patterns]
|
||
Conv1 --> Attention[Self-Attention Layers:<br/>Learn long-range interactions]
|
||
Attention --> Upsample[Upsampling Layers:<br/>Increase resolution]
|
||
Upsample --> Output_Layer[Output: RNA Coverage<br/>32 bp resolution]
|
||
end
|
||
|
||
Borzoi --> Tissues[Predictions for 89 Tissues/Cell Types]
|
||
|
||
Tissues --> TPM[TPM Values:<br/>Transcripts Per Million]
|
||
|
||
style Borzoi fill:#87CEEB
|
||
style TPM fill:#90EE90
|
||
```
|
||
|
||
**What is Borzoi?**
|
||
|
||
Borzoi is a deep learning model developed at Calico Life Sciences that predicts **RNA-seq coverage** (how much RNA is produced from each part of the genome) directly from DNA sequence. This is revolutionary because it:
|
||
|
||
- Predicts gene expression without actually doing wet-lab RNA sequencing
|
||
- Accounts for multiple layers of regulation (transcription, splicing, polyadenylation)
|
||
- Provides tissue-specific predictions
|
||
|
||
**How does it work?**
|
||
|
||
1. **Input**: 524,288 base pairs of DNA sequence (524 kb)
|
||
2. **Neural Network Processing**:
|
||
- **Convolutional layers**: Learn local DNA patterns (e.g., transcription factor binding sites)
|
||
- **Self-attention layers**: Learn long-range interactions between regulatory elements
|
||
- **Upsampling layers**: Increase resolution from 128 bp to 32 bp
|
||
3. **Output**: RNA coverage at 32 bp resolution for 89 different tissues/cell types
|
||
|
||
**Key Concept: RNA-seq Coverage**
|
||
|
||
RNA-seq coverage shows how many RNA molecules were sequenced at each position in the genome:
|
||
|
||
```
|
||
Position: 1000 1100 1200 1300 1400
|
||
Exon 1: ████████████████
|
||
Intron: ░
|
||
Exon 2: ████████████
|
||
Coverage: 2.5 3.1 0.1 2.8 3.0
|
||
```
|
||
|
||
**TPM (Transcripts Per Million)**
|
||
|
||
Borzoi outputs are converted to **TPM** values:
|
||
|
||
```
|
||
TPM = (Number of reads mapped to transcript / Transcript length in kb)
|
||
× (1,000,000 / Total reads in sample)
|
||
```
|
||
|
||
**Why TPM?**
|
||
|
||
- Normalizes for gene length (longer genes generate more reads)
|
||
- Normalizes for sequencing depth (accounts for total number of reads)
|
||
- Comparable across genes within a sample
|
||
|
||
**Pipeline Implementation:**
|
||
|
||
The pipeline has two Borzoi processes:
|
||
|
||
**Process 1: FILTER_VCF**
|
||
|
||
```python
|
||
# Extract variants in coding regions + 1000 bp upstream regulatory regions
|
||
# Create filtered VCF containing only protein-coding variants
|
||
```
|
||
|
||
**Process 2: PREDICT_EXPRESSION**
|
||
|
||
```python
|
||
# For each protein-coding gene with mutations:
|
||
# 1. Extract DNA sequence (reference + mutations)
|
||
# 2. Run Borzoi to predict RNA coverage
|
||
# 3. Calculate TPM by summing coverage over exons
|
||
# 4. Generate TPM matrix: Genes × Tissues
|
||
```
|
||
|
||
**Example Output:**
|
||
|
||
```csv
|
||
Gene,Adipose_Tissue,Brain_Cortex,Heart,Liver,Muscle
|
||
BRCA1,12.5,8.3,5.2,15.7,7.9
|
||
TP53,45.2,52.1,38.9,42.3,35.6
|
||
APOE,8.7,125.3,6.1,78.2,5.4
|
||
```
|
||
|
||
This table shows predicted RNA expression (TPM) for each gene in each tissue.
|
||
|
||
**MANE Dataset**
|
||
|
||
The pipeline uses the **MANE (Matched Annotation from NCBI and EMBL-EBI)** dataset:
|
||
|
||
- Contains reference transcript sequences for all human protein-coding genes
|
||
- Provides consensus between RefSeq and Ensembl/GENCODE annotations
|
||
- Includes exon/intron boundaries needed for TPM calculation
|
||
|
||
### 3. VCF2Prot: DNA Variants to Protein Sequences
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
VCF[VCF File:<br/>Genetic Variants] --> Annotate[BCFtools CSQ:<br/>Annotate Variants]
|
||
Reference[Reference Genome<br/>GRCh38] --> Annotate
|
||
GFF[Gene Annotations<br/>GFF3 Format] --> Annotate
|
||
|
||
Annotate --> Annotated_VCF[Annotated VCF:<br/>Functional Consequences]
|
||
|
||
Annotated_VCF --> VCF2Prot[VCF2Prot Tool]
|
||
MANE_Ref[MANE Reference<br/>Transcripts] --> VCF2Prot
|
||
|
||
VCF2Prot --> Mutated_Proteins[Mutated Protein<br/>Sequences FASTA]
|
||
|
||
style VCF2Prot fill:#FFB6C1
|
||
style Mutated_Proteins fill:#90EE90
|
||
```
|
||
|
||
**What is VCF2Prot?**
|
||
|
||
VCF2Prot is a tool that translates DNA variants into their effects on protein sequences. It:
|
||
|
||
- Takes variants from VCF files
|
||
- Maps them to gene transcripts
|
||
- Predicts how variants change the protein sequence
|
||
- Outputs mutated protein sequences
|
||
|
||
**Process Flow:**
|
||
|
||
1. **Variant Annotation (BCFtools CSQ)**
|
||
|
||
- Maps variants to genes and transcripts
|
||
- Determines functional consequence:
|
||
- Missense: Changes one amino acid
|
||
- Nonsense: Creates premature stop codon
|
||
- Frameshift: Shifts reading frame
|
||
- Synonymous: No change to amino acid
|
||
|
||
2. **Protein Sequence Prediction (VCF2Prot)**
|
||
|
||
- Loads reference protein sequences from MANE
|
||
- Applies variants to generate mutated sequences
|
||
- Handles complex variants (insertions, deletions)
|
||
|
||
**Example:**
|
||
|
||
```
|
||
Reference DNA: ATG GCT AAA TGC
|
||
Reference RNA: AUG GCU AAA UGC
|
||
Reference Prot: Met-Ala-Lys-Cys
|
||
|
||
Variant: Position 5, G→T
|
||
|
||
Mutant DNA: ATG TCT AAA TGC
|
||
Mutant RNA: AUG UCU AAA UGC
|
||
Mutant Prot: Met-Ser-Lys-Cys
|
||
^^^
|
||
Changed amino acid!
|
||
```
|
||
|
||
**Output Format: FASTA**
|
||
|
||
```
|
||
>Patient_001_ENST00000357654_BRCA1_p.G1738R
|
||
MSLQSQLFKQRQYLSIKTKRSTKEVLDATLIHQSITGLYETRIDLSQLGGD...
|
||
>Patient_001_ENST00000269305_TP53_p.R273H
|
||
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIE...
|
||
```
|
||
|
||
Each sequence shows:
|
||
|
||
- Patient ID
|
||
- Transcript ID
|
||
- Gene name
|
||
- Protein variant notation
|
||
- Full mutated protein sequence
|
||
|
||
### 4. RNA2ProteinExpression: RNA to Protein Prediction
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
RNA_TPM[RNA TPM Values<br/>from Borzoi] --> Model[Deep Learning Model]
|
||
|
||
subgraph Neural Network
|
||
Input[Input Layer:<br/>RNA Expression]
|
||
Hidden1[Hidden Layer 1:<br/>Tissue Context]
|
||
Hidden2[Hidden Layer 2:<br/>Translation Efficiency]
|
||
Output[Output Layer:<br/>Protein Abundance]
|
||
|
||
Input --> Hidden1
|
||
Hidden1 --> Hidden2
|
||
Hidden2 --> Output
|
||
end
|
||
|
||
Model --> Protein_Expr[Protein Expression<br/>Log2 Scale]
|
||
|
||
style Model fill:#FFE4E1
|
||
style Protein_Expr fill:#90EE90
|
||
```
|
||
|
||
**What is RNA2ProteinExpression?**
|
||
|
||
This is a custom deep learning model that predicts protein abundance from RNA expression levels. It's trained on:
|
||
|
||
- RNA-seq data (transcript levels)
|
||
- Mass spectrometry data (protein levels)
|
||
- Gene Ontology (GO) annotations
|
||
|
||
**Why is this needed?**
|
||
|
||
RNA levels don't perfectly correlate with protein levels because:
|
||
|
||
- **Translation efficiency** varies between genes
|
||
- **Protein stability** varies (some proteins are rapidly degraded)
|
||
- **Post-transcriptional regulation** (microRNAs, RNA-binding proteins)
|
||
|
||
Typical RNA-protein correlation: **r = 0.4-0.6** (not 1.0!)
|
||
|
||
**Model Architecture:**
|
||
|
||
The neural network learns:
|
||
|
||
- Which genes have high vs. low translation efficiency
|
||
- Tissue-specific effects on protein production
|
||
- GO term enrichments that affect protein stability
|
||
|
||
**Input:**
|
||
|
||
```
|
||
Gene_ID, Tissue, RNA_TPM
|
||
ENSG00000012048, Brain_Cortex, 125.3
|
||
ENSG00000012048, Liver, 78.2
|
||
```
|
||
|
||
**Output:**
|
||
|
||
```
|
||
Gene_ID, Tissue, Protein_Expression_log2
|
||
ENSG00000012048, Brain_Cortex, 8.5
|
||
ENSG00000012048, Liver, 7.2
|
||
```
|
||
|
||
**Log2 scale**: Protein expression in log2 transformed units (easier to interpret fold-changes)
|
||
|
||
### 5. CORTO: Metabolome Prediction
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
RNA_TPM[RNA TPM Matrix] --> CORTO[CORTO Algorithm]
|
||
Regulon[Regulon Data:<br/>TF-Gene Relationships] --> CORTO
|
||
|
||
subgraph CORTO Algorithm
|
||
Correlation[1. Calculate Correlations:<br/>TF <-> Metabolic Genes]
|
||
DPI[2. Data Processing Inequality:<br/>Remove Indirect Edges]
|
||
Bootstrap[3. Bootstrap:<br/>Assess Robustness]
|
||
MRA[4. Master Regulator Analysis:<br/>Identify Key TFs]
|
||
|
||
Correlation --> DPI
|
||
DPI --> Bootstrap
|
||
Bootstrap --> MRA
|
||
end
|
||
|
||
CORTO --> Metabolome[Metabolome Predictions:<br/>Metabolic Activity]
|
||
|
||
style CORTO fill:#F0E68C
|
||
style Metabolome fill:#90EE90
|
||
```
|
||
|
||
**What is CORTO?**
|
||
|
||
CORTO (Correlation Tool) is an R package that infers gene regulatory networks and identifies master regulators controlling metabolic activity. It predicts:
|
||
|
||
- Activity of metabolic pathways
|
||
- Transcription factors (TFs) controlling metabolism
|
||
- Metabolite production levels
|
||
|
||
**How it works:**
|
||
|
||
1. **Input Regulon**: Pre-defined relationships between transcription factors (TFs) and their target genes (including metabolic enzymes)
|
||
|
||
2. **Correlation Analysis**: Calculate how TF expression correlates with target gene expression
|
||
|
||
3. **Data Processing Inequality (DPI)**: Remove indirect relationships
|
||
|
||
- If TF1 → TF2 → Gene, remove direct TF1 → Gene edge
|
||
- Keeps only direct regulatory relationships
|
||
|
||
4. **Bootstrap**: Test robustness by resampling data
|
||
|
||
5. **Master Regulator Analysis (MRA)**: Identify TFs whose target genes are significantly enriched in metabolic pathways
|
||
|
||
**Example:**
|
||
|
||
```
|
||
TF: PPARG (master regulator of fat metabolism)
|
||
Target Genes: FABP4, LPL, ADIPOQ, CD36, SCD (all involved in lipid metabolism)
|
||
Metabolome Prediction: High lipid synthesis activity
|
||
```
|
||
|
||
**Output:**
|
||
|
||
```csv
|
||
Pathway,Activity_Score,P_value
|
||
Glycolysis,-1.5,0.001
|
||
TCA_Cycle,2.3,0.0001
|
||
Fatty_Acid_Synthesis,1.8,0.002
|
||
```
|
||
|
||
- **Activity Score**: Positive = pathway activated, Negative = pathway suppressed
|
||
- **P-value**: Statistical significance
|
||
|
||
### 6. CIBERSORTx: Immune Cell Deconvolution
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
RNA_TPM[Bulk RNA TPM<br/>Mixed Cell Types] --> Signature[Signature Matrix:<br/>Cell-Specific Genes]
|
||
|
||
subgraph CIBERSORTx
|
||
Fractions[Step 1: Fractions<br/>Estimate Cell Proportions]
|
||
HiRes[Step 2: HiRes<br/>Cell-Specific Expression]
|
||
end
|
||
|
||
RNA_TPM --> Fractions
|
||
Signature --> Fractions
|
||
|
||
Fractions --> Proportions[Cell Type Proportions]
|
||
|
||
Proportions --> HiRes
|
||
RNA_TPM --> HiRes
|
||
Signature --> HiRes
|
||
|
||
HiRes --> Cell_Specific[Cell-Type-Specific<br/>Gene Expression]
|
||
|
||
style CIBERSORTx fill:#DDA0DD
|
||
style Proportions fill:#90EE90
|
||
style Cell_Specific fill:#90EE90
|
||
```
|
||
|
||
**What is CIBERSORTx?**
|
||
|
||
CIBERSORTx is a computational tool for **immune cell deconvolution**. When you sequence RNA from a tissue sample, you get a mixture of RNA from all cells in that tissue. CIBERSORTx:
|
||
|
||
- Estimates what proportion of cells are each immune cell type
|
||
- Infers cell-type-specific gene expression profiles
|
||
|
||
**Why is this important?**
|
||
|
||
Immune cells play crucial roles in:
|
||
|
||
- Fighting infections
|
||
- Cancer immunotherapy response
|
||
- Autoimmune diseases
|
||
- Inflammation
|
||
|
||
Understanding immune composition helps interpret disease mechanisms.
|
||
|
||
**How it works:**
|
||
|
||
**Step 1: Signature Matrix**
|
||
|
||
A reference matrix showing genes specifically expressed in each cell type:
|
||
|
||
```
|
||
Gene T_cells B_cells Macrophages NK_cells
|
||
CD3D HIGH low low low
|
||
CD19 low HIGH low low
|
||
CD68 low low HIGH low
|
||
NKG7 low low low HIGH
|
||
```
|
||
|
||
**Step 2: CIBERSORTx Fractions**
|
||
|
||
Uses **Support Vector Regression (SVR)** to solve:
|
||
|
||
```
|
||
Bulk_Expression = Σ (Proportion_i × Signature_i)
|
||
```
|
||
|
||
Where:
|
||
|
||
- Bulk_Expression = measured RNA in tissue
|
||
- Proportion_i = fraction of cell type i
|
||
- Signature_i = expression pattern of cell type i
|
||
|
||
**Step 3: CIBERSORTx HiRes**
|
||
|
||
After knowing proportions, infer gene expression within each cell type by:
|
||
|
||
- Modeling tissue expression as weighted sum of cell-type contributions
|
||
- Deconvolving to separate cell-type-specific signals
|
||
|
||
**Example Output:**
|
||
|
||
**Fractions:**
|
||
|
||
```csv
|
||
Sample,CD8_T_cells,CD4_T_cells,B_cells,NK_cells,Monocytes
|
||
Patient_001_Brain,0.05,0.08,0.02,0.01,0.15
|
||
Patient_001_Liver,0.12,0.15,0.08,0.03,0.22
|
||
```
|
||
|
||
**HiRes:**
|
||
|
||
```csv
|
||
Tissue,Cell_Type,CD3D,CD19,CD68
|
||
Brain_Patient_001,CD8_T_cells,HIGH,low,low
|
||
Brain_Patient_001,B_cells,low,HIGH,low
|
||
```
|
||
|
||
**Pipeline Implementation:**
|
||
|
||
1. **CONVERT_TO_TXT**: Convert CSV to tab-delimited format (CIBERSORTx input format)
|
||
|
||
2. **CIBERSORTx_FRACTIONS**: Estimate cell proportions
|
||
|
||
3. **CIBERSORTx_HIRES**: Infer cell-specific expression
|
||
|
||
4. **ADD_TISSUE_NAMES**: Add tissue annotations to output
|
||
|
||
---
|
||
|
||
## Workflow Execution
|
||
|
||
### Nextflow: Workflow Orchestration
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
Config[nextflow.config:<br/>Configuration] --> NF[Nextflow Engine]
|
||
Params[params.json:<br/>Parameters] --> NF
|
||
|
||
subgraph Nextflow Engine
|
||
Parse[Parse Workflow DSL]
|
||
Schedule[Schedule Processes]
|
||
Execute[Execute in Docker/Singularity]
|
||
Monitor[Monitor & Checkpoint]
|
||
|
||
Parse --> Schedule
|
||
Schedule --> Execute
|
||
Execute --> Monitor
|
||
end
|
||
|
||
NF --> Channels[Data Channels:<br/>Pass Files Between Processes]
|
||
Channels --> Processes[Execute Processes]
|
||
|
||
style NF fill:#87CEEB
|
||
```
|
||
|
||
**What is Nextflow?**
|
||
|
||
Nextflow is a workflow orchestration system specifically designed for data-intensive computational pipelines. It:
|
||
|
||
- Manages dependencies between analysis steps
|
||
- Handles parallel execution
|
||
- Provides automatic checkpointing (resume failed runs)
|
||
- Supports multiple execution platforms (local, HPC clusters, cloud)
|
||
|
||
**Key Concepts:**
|
||
|
||
1. **Processes**: Individual computational tasks (e.g., "PREDICT_EXPRESSION")
|
||
2. **Channels**: Data streams that connect processes
|
||
3. **Operators**: Manipulate channels (e.g., `mix`, `flatten`, `collect`)
|
||
|
||
**Example Process Definition:**
|
||
|
||
```nextflow
|
||
process PREDICT_EXPRESSION {
|
||
container "${params.container_borzoi}" // Docker image
|
||
memory 4.GB // Memory requirement
|
||
accelerator 1 // GPU requirement
|
||
|
||
input:
|
||
path vcf_filtered // Input file
|
||
path MANE // Reference data
|
||
|
||
output:
|
||
path "*_TPM.csv" // Output file pattern
|
||
|
||
script:
|
||
"""
|
||
#!/opt/conda/envs/borzoi/bin/python
|
||
# Python script here
|
||
"""
|
||
}
|
||
```
|
||
|
||
**Channel Example:**
|
||
|
||
```nextflow
|
||
// Mix male and female patient VCFs
|
||
txt_ch = f_var.mix(m_var).flatten()
|
||
|
||
// This creates a channel with all VCF files:
|
||
// [Patient_001.vcf, Patient_002.vcf, ...]
|
||
```
|
||
|
||
### Complete Workflow
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
Start([Start]) --> CheckDisease{Disease or Healthy?}
|
||
|
||
CheckDisease -->|Disease| GetStats[get_disease_stats_no_patients:<br/>Analyze UK Biobank]
|
||
CheckDisease -->|Healthy| LoadHealthy[Load Pre-computed<br/>Healthy Genomes]
|
||
|
||
GetStats --> GenM[generate_m_variants_cudf:<br/>Male Patients]
|
||
GetStats --> GenF[generate_f_variants_cudf:<br/>Female Patients]
|
||
|
||
LoadHealthy --> LoadM[Load Male<br/>Reference]
|
||
LoadHealthy --> LoadF[Load Female<br/>Reference]
|
||
|
||
GenM --> MakeVCF[make_vcfs:<br/>Generate VCF Files]
|
||
GenF --> MakeVCF
|
||
LoadM --> MakeVCF
|
||
LoadF --> MakeVCF
|
||
|
||
MakeVCF --> FilterVCF[FILTER_VCF:<br/>Extract Coding Variants]
|
||
MakeVCF --> VCF2Prot[VCF2PROT:<br/>Generate Mutated Proteins]
|
||
|
||
FilterVCF --> PredictExpr[PREDICT_EXPRESSION:<br/>Borzoi RNA Prediction]
|
||
|
||
PredictExpr --> RNA2Prot[RNA2PROTEXPRESSION:<br/>Protein Prediction]
|
||
PredictExpr --> CORTO[CORTO:<br/>Metabolome Prediction]
|
||
PredictExpr --> Convert[CONVERT_TO_TXT:<br/>Format Conversion]
|
||
|
||
Convert --> CiberFrac[CIBERSORTx_FRACTIONS:<br/>Cell Proportions]
|
||
CiberFrac --> CiberHires[CIBERSORTx_HIRES:<br/>Cell-Specific Expression]
|
||
CiberHires --> AddTissue[ADD_TISSUE_NAMES_TO_CIBERSORTX:<br/>Annotate Results]
|
||
|
||
RNA2Prot --> End([Complete<br/>Digital Patient])
|
||
CORTO --> End
|
||
AddTissue --> End
|
||
VCF2Prot --> End
|
||
|
||
style Start fill:#90EE90
|
||
style End fill:#FFB6C1
|
||
style PredictExpr fill:#87CEEB
|
||
```
|
||
|
||
### Execution Example
|
||
|
||
**1. Configuration (params.json)**
|
||
|
||
```json
|
||
{
|
||
"disease": "schizophrenia",
|
||
"n_pat": 10,
|
||
"percent_male": 0.5,
|
||
"container_borzoi": "harbor.cluster.omic.ai/omic/digital-patients/borzoi:latest"
|
||
}
|
||
```
|
||
|
||
**2. Launch Pipeline**
|
||
|
||
```bash
|
||
nextflow run test.nf -params-file params.json
|
||
```
|
||
|
||
**3. Nextflow Execution**
|
||
|
||
```
|
||
N E X T F L O W ~ version 21.04.0
|
||
Launching `test.nf` [amazing_babbage] - revision: 1a2b3c4d
|
||
|
||
[Synthea] Submitted process > get_disease_stats_no_patients
|
||
[Synthea] Submitted process > generate_m_variants_cudf (1)
|
||
[Synthea] Submitted process > generate_f_variants_cudf (1)
|
||
[Stage] Completed process > make_vcfs (10 files)
|
||
[Borzoi] Submitted process > FILTER_VCF (10)
|
||
[Borzoi] Submitted process > PREDICT_EXPRESSION (10)
|
||
...
|
||
Pipeline completed successfully!
|
||
```
|
||
|
||
**4. Directory Structure**
|
||
|
||
```
|
||
/outdir/
|
||
├── vcf2prot/
|
||
│ ├── Patient_001_transcript_id_mutations.fasta
|
||
│ └── Patient_002_transcript_id_mutations.fasta
|
||
├── borzoi/
|
||
│ ├── Patient_001_TPM.csv
|
||
│ └── Patient_002_TPM.csv
|
||
├── rna2protein/
|
||
│ ├── Patient_001_Protein_Expression_log2.csv
|
||
│ └── Patient_002_Protein_Expression_log2.csv
|
||
├── corto/
|
||
│ ├── Patient_001_metabolome.csv
|
||
│ └── Patient_002_metabolome.csv
|
||
└── ecotyper/
|
||
├── fractions/
|
||
│ └── Patient_001_CIBERSORTx_Results.txt
|
||
└── hires/
|
||
└── Patient_001_immune_cells.csv
|
||
```
|
||
|
||
---
|
||
|
||
## Technical Architecture
|
||
|
||
### Docker Containers
|
||
|
||
Each pipeline component runs in an isolated Docker container with specific dependencies:
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
subgraph Docker Images
|
||
Synthea_Img[Synthea Container:<br/>- Java JDK<br/>- Python<br/>- BCFtools<br/>- GATK]
|
||
|
||
Borzoi_Img[Borzoi Container:<br/>- TensorFlow<br/>- PyTorch<br/>- Baskerville<br/>- Python packages]
|
||
|
||
VCF2Prot_Img[VCF2Prot Container:<br/>- BCFtools<br/>- vcf2prot binary<br/>- Reference genomes]
|
||
|
||
RNA2Prot_Img[RNA2Protein Container:<br/>- PyTorch<br/>- Deep learning model<br/>- GO annotations]
|
||
|
||
CORTO_Img[CORTO Container:<br/>- R<br/>- corto package<br/>- Regulon data]
|
||
|
||
CIBERSORTx_Img[CIBERSORTx Container:<br/>- Python<br/>- R<br/>- CIBERSORTx binaries<br/>- Signature matrices]
|
||
end
|
||
|
||
Registry[Container Registry:<br/>harbor.cluster.omic.ai] --> Synthea_Img
|
||
Registry --> Borzoi_Img
|
||
Registry --> VCF2Prot_Img
|
||
Registry --> RNA2Prot_Img
|
||
Registry --> CORTO_Img
|
||
Registry --> CIBERSORTx_Img
|
||
```
|
||
|
||
**Why Docker?**
|
||
|
||
- **Reproducibility**: Same environment every run
|
||
- **Isolation**: Avoid dependency conflicts
|
||
- **Portability**: Run anywhere (laptop, cluster, cloud)
|
||
|
||
**Example Dockerfile (Borzoi):**
|
||
|
||
```dockerfile
|
||
FROM tensorflow/tensorflow:2.12.0-gpu
|
||
|
||
# Install conda
|
||
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
|
||
RUN bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda
|
||
|
||
# Create borzoi environment
|
||
RUN conda create -n borzoi python=3.9
|
||
RUN conda install -n borzoi tensorflow-gpu pandas numpy pysam
|
||
|
||
# Install baskerville (Borzoi's framework)
|
||
RUN git clone https://github.com/calico/borzoi.git /home/omic/borzoi
|
||
RUN pip install -e /home/omic/borzoi
|
||
|
||
# Download pre-trained models
|
||
RUN mkdir -p /home/omic/borzoi/saved_models
|
||
RUN wget <model_url> -O /home/omic/borzoi/saved_models/f0/model0_best.h5
|
||
|
||
# Set entrypoint
|
||
CMD ["/bin/bash"]
|
||
```
|
||
|
||
### Data Flow Architecture
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
subgraph Input Data
|
||
UKBB[UK Biobank:<br/>Genetic Variants]
|
||
MANE[MANE Transcripts:<br/>Reference Sequences]
|
||
RefGenome[Reference Genome:<br/>GRCh38]
|
||
Regulon[Regulon Database:<br/>TF-Gene Networks]
|
||
LM22[LM22 Signature Matrix:<br/>Immune Cell Markers]
|
||
end
|
||
|
||
subgraph Processing
|
||
Synthea --> |VCF| Filter
|
||
Filter --> |Filtered VCF| Borzoi
|
||
Filter --> |Filtered VCF| VCF2Prot
|
||
Borzoi --> |TPM CSV| RNA2Prot
|
||
Borzoi --> |TPM CSV| CORTO
|
||
Borzoi --> |TPM CSV| CIBERSORTx
|
||
end
|
||
|
||
UKBB --> Synthea
|
||
MANE --> Borzoi
|
||
MANE --> VCF2Prot
|
||
RefGenome --> VCF2Prot
|
||
Regulon --> CORTO
|
||
LM22 --> CIBERSORTx
|
||
|
||
subgraph Output Data
|
||
RNA2Prot --> |Protein Expression| Results
|
||
CORTO --> |Metabolome| Results
|
||
CIBERSORTx --> |Immune Cells| Results
|
||
VCF2Prot --> |Mutated Proteins| Results
|
||
end
|
||
```
|
||
|
||
### Computational Requirements
|
||
|
||
| Process | CPU | RAM | GPU | Time (per patient) |
|
||
| ------------------ | ------- | ---- | -------- | ------------------ |
|
||
| Synthea | 2 cores | 4 GB | No | ~5 min |
|
||
| FILTER_VCF | 4 cores | 4 GB | No | ~2 min |
|
||
| PREDICT_EXPRESSION | 8 cores | 4 GB | Yes (1x) | ~30-60 min |
|
||
| VCF2PROT | 2 cores | 2 GB | No | ~10 min |
|
||
| RNA2PROTEXPRESSION | 4 cores | 2 GB | Yes (1x) | ~5 min |
|
||
| CORTO | 2 cores | 1 GB | No | ~3 min |
|
||
| CIBERSORTx | 4 cores | 4 GB | No | ~15 min |
|
||
|
||
**Total time per patient: ~70-90 minutes**
|
||
|
||
**Parallelization:**
|
||
|
||
Nextflow automatically parallelizes patient processing:
|
||
|
||
- 10 patients with 4 GPUs → ~20-25 minutes total
|
||
- Patients are processed independently
|
||
|
||
---
|
||
|
||
## Outputs and Applications
|
||
|
||
### Complete Digital Patient Profile
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
Patient[Digital Patient:<br/>Patient_001] --> Genome[Genomic Profile]
|
||
Patient --> Transcriptome[Transcriptomic Profile]
|
||
Patient --> Proteome[Proteomic Profile]
|
||
Patient --> Metabolome[Metabolomic Profile]
|
||
Patient --> Immune[Immune Profile]
|
||
|
||
Genome --> G1[Genetic Variants:<br/>SNPs, Indels]
|
||
Genome --> G2[Disease-Associated<br/>Mutations]
|
||
|
||
Transcriptome --> T1[RNA Expression:<br/>20,000+ genes]
|
||
Transcriptome --> T2[Tissue-Specific<br/>Expression]
|
||
|
||
Proteome --> P1[Protein Abundance:<br/>10,000+ proteins]
|
||
Proteome --> P2[Mutated Protein<br/>Sequences]
|
||
|
||
Metabolome --> M1[Pathway Activity:<br/>Metabolism]
|
||
Metabolome --> M2[Master Regulator<br/>TFs]
|
||
|
||
Immune --> I1[Cell Composition:<br/>T-cells, B-cells, etc.]
|
||
Immune --> I2[Cell-Specific<br/>Expression]
|
||
|
||
style Patient fill:#FFD700
|
||
```
|
||
|
||
### Example Application: Cancer Research
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
Question[Research Question:<br/>How do BRCA1 mutations<br/>affect breast cancer?]
|
||
|
||
Question --> Generate[Generate 100 synthetic<br/>patients with BRCA1<br/>mutations]
|
||
|
||
Generate --> Analyze{Analyze Digital Patients}
|
||
|
||
Analyze --> RNA[RNA Analysis:<br/>Find genes<br/>co-expressed with BRCA1]
|
||
Analyze --> Protein[Protein Analysis:<br/>Identify altered<br/>pathways]
|
||
Analyze --> Metabolome[Metabolome Analysis:<br/>Detect metabolic<br/>shifts]
|
||
Analyze --> Immune[Immune Analysis:<br/>Characterize immune<br/>infiltration]
|
||
|
||
RNA --> Insight[Insights into<br/>Disease Mechanisms]
|
||
Protein --> Insight
|
||
Metabolome --> Insight
|
||
Immune --> Insight
|
||
|
||
Insight --> Drug[Drug Target<br/>Discovery]
|
||
Insight --> Biomarker[Biomarker<br/>Identification]
|
||
|
||
style Question fill:#FFB6C1
|
||
style Insight fill:#90EE90
|
||
```
|
||
|
||
### Output Files Summary
|
||
|
||
**1. Genetic Variants (VCF)**
|
||
|
||
- **File**: `Patient_001_variants.vcf`
|
||
- **Content**: All genetic variants for patient
|
||
- **Use**: Understand genetic basis of disease
|
||
|
||
**2. RNA Expression (TPM)**
|
||
|
||
- **File**: `Patient_001_TPM.csv`
|
||
- **Content**: Gene expression across 89 tissues
|
||
- **Use**: Identify dysregulated genes, tissue-specific effects
|
||
|
||
**3. Protein Expression**
|
||
|
||
- **File**: `Patient_001_Protein_Expression_log2.csv`
|
||
- **Content**: Predicted protein abundance
|
||
- **Use**: Understand functional consequences of RNA changes
|
||
|
||
**4. Mutated Proteins (FASTA)**
|
||
|
||
- **File**: `Patient_001_transcript_id_mutations.fasta`
|
||
- **Content**: Protein sequences with mutations
|
||
- **Use**: Study structural changes, predict drug binding
|
||
|
||
**5. Metabolome**
|
||
|
||
- **File**: `Patient_001_metabolome.csv`
|
||
- **Content**: Pathway activity scores
|
||
- **Use**: Understand metabolic reprogramming
|
||
|
||
**6. Immune Cells**
|
||
|
||
- **File**: `Patient_001_immune_cells.csv`
|
||
- **Content**: Cell type proportions and expression
|
||
- **Use**: Characterize immune microenvironment
|
||
|
||
### Research Applications
|
||
|
||
1. **Disease Mechanism Discovery**
|
||
|
||
- Generate patients with specific mutations
|
||
- Compare to healthy controls
|
||
- Identify molecular changes caused by mutations
|
||
|
||
2. **Drug Target Identification**
|
||
|
||
- Find genes/proteins consistently altered across patients
|
||
- Prioritize targets for therapeutic intervention
|
||
|
||
3. **Biomarker Discovery**
|
||
|
||
- Identify molecular signatures distinguishing diseased from healthy
|
||
- Develop diagnostic tests
|
||
|
||
4. **Precision Medicine**
|
||
|
||
- Model individual patient molecular profiles
|
||
- Predict treatment response
|
||
- Personalize therapy
|
||
|
||
5. **Clinical Trial Simulation**
|
||
|
||
- Generate virtual patient cohorts
|
||
- Test hypotheses before expensive trials
|
||
- Power calculations and study design
|
||
|
||
6. **Education and Training**
|
||
|
||
- Teach students about multi-omics analysis
|
||
- No patient privacy concerns
|
||
- Unlimited data generation
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
This Digital Patient Pipeline represents a cutting-edge integration of:
|
||
|
||
- **Synthetic biology**: Realistic patient simulation
|
||
- **Deep learning**: RNA expression prediction from DNA
|
||
- **Bioinformatics**: Multi-omic data integration
|
||
- **Workflow engineering**: Scalable, reproducible pipelines
|
||
|
||
By combining these technologies, researchers can:
|
||
|
||
- Study disease mechanisms without accessing sensitive patient data
|
||
- Generate unlimited data for hypothesis testing
|
||
- Model personalized molecular phenotypes
|
||
- Accelerate drug discovery and precision medicine
|
||
|
||
The pipeline produces comprehensive molecular profiles spanning:
|
||
|
||
- **Genomics** (DNA variants)
|
||
- **Transcriptomics** (RNA expression)
|
||
- **Proteomics** (protein abundance)
|
||
- **Metabolomics** (metabolic activity)
|
||
- **Immunomics** (immune cell composition)
|
||
|
||
This multi-omic integration provides unprecedented insight into how genetic variants cascade through biological systems to produce disease phenotypes.
|
||
|
||
---
|
||
|
||
## Glossary of Terms
|
||
|
||
**Bioinformatics Terms:**
|
||
|
||
- **TPM**: Transcripts Per Million - normalized measure of RNA abundance
|
||
- **VCF**: Variant Call Format - standard format for genetic variants
|
||
- **FASTA**: Text format for representing DNA/protein sequences
|
||
- **SNP**: Single Nucleotide Polymorphism - single base DNA variant
|
||
- **Indel**: Insertion or Deletion - adding or removing DNA bases
|
||
- **Exon**: Protein-coding segment of gene
|
||
- **Intron**: Non-coding segment removed during RNA processing
|
||
- **Transcription**: DNA → RNA
|
||
- **Translation**: RNA → Protein
|
||
- **Gene Expression**: Process of making protein from gene
|
||
|
||
**Machine Learning Terms:**
|
||
|
||
- **Neural Network**: Computer model inspired by brain structure
|
||
- **Convolutional Layer**: Detects local patterns in sequences
|
||
- **Attention Layer**: Learns long-range relationships
|
||
- **Training**: Teaching model from example data
|
||
- **Prediction**: Using model on new data
|
||
|
||
**Pipeline Terms:**
|
||
|
||
- **Process**: Individual computational step
|
||
- **Channel**: Data stream connecting processes
|
||
- **Container**: Isolated environment with dependencies
|
||
- **Workflow**: Series of connected processes
|
||
- **Nextflow**: Workflow orchestration system
|
||
|
||
**Biological Terms:**
|
||
|
||
- **Genome**: Complete set of DNA in organism
|
||
- **Transcriptome**: Complete set of RNA molecules
|
||
- **Proteome**: Complete set of proteins
|
||
- **Metabolome**: Complete set of metabolites
|
||
- **Phenotype**: Observable characteristics of organism
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
### Pipeline Components
|
||
|
||
1. **Nextflow**: Di Tommaso et al. (2017). Nextflow enables reproducible computational workflows. *Nature Biotechnology*, 35:316-319.
|
||
|
||
2. **Synthea**: Walone et al. (2017). Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. *JAMIA*, 25(3):230-238.
|
||
|
||
3. **Borzoi**: Linder et al. (2023). Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. *Nature Genetics*, 56:164-173.
|
||
|
||
4. **CIBERSORTx**: Newman et al. (2019). Determining cell-type abundance and expression from bulk tissues with digital cytometry. *Nature Biotechnology*, 37:773-782.
|
||
|
||
5. **CORTO**: Mercatelli et al. (2020). corto: a lightweight R package for gene network inference and master regulator analysis. *Bioinformatics*, 36(12):3916-3917.
|
||
|
||
### Biological Background
|
||
|
||
6. **Gene Structure**: Lim et al. (2018). The exon-intron gene structure upstream of the initiation codon. *Nucleic Acids Research*, 46(5):2232-2244.
|
||
|
||
7. **TPM Normalization**: Zhao et al. (2020). Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols. *RNA*, 26(8):903-909.
|
||
|
||
8. **VCF Format**: Danecek et al. (2011). The variant call format and VCFtools. *Bioinformatics*, 27(15):2156-2158.
|
||
|
||
9. **MANE**: Morales et al. (2022). A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. *Nature*, 604:310-315.
|
||
|
||
---
|
||
|
||
*This documentation was created to provide comprehensive understanding of a complex bioinformatics pipeline for researchers without extensive biological background. For questions or clarifications, please consult the cited references or contact the pipeline maintainers.*
|