A Python implementation of the corto algorithm for analyzing metabolomics and gene expression data, translated from the original R codebase. This project provides tools for preprocessing multi-omics data and performing network analysis to identify relationships between metabolites and gene expression.

Background

The original corto algorithm was implemented in R for analyzing gene expression data and identifying master regulators. This project extends and modernizes the implementation by:

Translating core functionality to Python
Adding support for metabolomics data
Implementing memory-efficient processing for large datasets
Adding parallel processing capabilities
Providing a robust command-line interface

Code Translation Overview

Original R Components:

The project translates code from several R source files:

corto.R: Core algorithm implementation
functions.R: Utility functions and statistical analysis
mra.R: Master Regulator Analysis functionality
gsea.R: Gene Set Enrichment Analysis components

Python Implementation:

The functionality has been reorganized into two main Python scripts:

corto-data-prep-final.py:

Data loading and validation
Preprocessing pipeline
CNV correction
Quality control metrics

corto-matrix-combination-final.py:

Network analysis implementation
Correlation calculations
Bootstrap analysis
Results generation

Installation

# Clone the repository
git clone https://github.com/yourusername/corto-metabolomics.git

# Install required packages
pip install -r requirements.txt

Usage

Data Preparation

python corto-data-prep-final.py \
    --metabolomics_file data/metabolomics.csv \
    --expression_file data/expression.txt \
    --cnv_file data/cnv.csv \
    --normalization standard \
    --outlier_detection zscore \
    --imputation knn

Network Analysis

python corto-matrix-combination-final.py \
    --mode corto \
    --expression_file prepared_expression.csv \
    --metabolomics_file prepared_metabolomics.csv \
    --p_threshold 1e-30 \
    --nbootstraps 100 \
    --nthreads 4 \
    --verbose

Key Features

Data Preprocessing

Zero-variance feature removal
CNV correction
Outlier detection
Missing value imputation
Sample alignment
Quality control metrics

Network Analysis

Two analysis modes:
- 'corto': Original approach keeping matrices separate
- 'combined': Matrix combination approach for higher-order relationships
Parallel processing for bootstraps
Memory-efficient chunked processing
Comprehensive result reporting

Output Files

The pipeline generates several output files:

Preprocessed Data:

prepared_metabolomics.csv
prepared_expression.csv
prepared_metrics.txt

Network Analysis:

corto_network_{mode}.csv: Network edges and statistics
corto_regulon_{mode}.txt: Regulon object with relationship details