Files
corto/README.md

4.1 KiB

Corto Metabolomics Analysis Pipeline

A Python implementation of the corto algorithm for analyzing metabolomics and gene expression data, translated from the original R codebase. This project provides tools for preprocessing multi-omics data and performing network analysis to identify relationships between metabolites and gene expression.

Background

The original corto algorithm was implemented in R for analyzing gene expression data and identifying master regulators. This project extends and modernizes the implementation by:

  1. Translating core functionality to Python
  2. Adding support for metabolomics data
  3. Implementing memory-efficient processing for large datasets
  4. Adding parallel processing capabilities
  5. Providing a robust command-line interface

Code Translation Overview

Detailed Code Translation Mapping

corto-data-prep-final.py

This script primarily implements functionality from corto.R:

  1. Data Loading and Validation
  • Initial data loading logic from corto() function
  • Input validation checks in validate_ccle_format()
  • Initial data preprocessing steps in preprocess_ccle_data()
  1. Zero Variance Feature Handling
  • Translates zero variance removal logic:
# From corto.R
if(sum(is.na(inmat))>0){
    stop("Input matrix contains NA fields")
}
allvars<-apply(inmat,1,var)
keep<-names(allvars)[allvars>0]
inmat<-inmat[keep,]
  1. CNV Correction
  • Implements CNV correction logic from corto.R:
if(!is.null(cnvmat)){
    commonrows<-intersect(rownames(cnvmat),rownames(inmat))
    commoncols<-intersect(colnames(cnvmat),colnames(inmat))
    cnvmat<-cnvmat[commonrows,commoncols]
    inmat<-inmat[commonrows,commoncols]

corto-matrix-combination-final.py

This script implements functionality from multiple R sources:

  1. From functions.R:
  • Direct translation of p2r():
p2r<-function(p,n){
    t<-qt(p/2,df=n-2,lower.tail=FALSE)
    r<-sqrt((t^2)/(n-2+t^2))
    return(r)
}
  1. From mra.R:
  • Correlation calculation logic from MRA functions
  • Bootstrap implementation approach
  1. From gsea.R:
  • Statistical analysis approaches
  • Matrix manipulation techniques

Key Implementation Differences

  1. Memory Management:
  • Added chunked processing for large matrices
  • Implemented parallel processing with ProcessPoolExecutor
  1. Extended Functionality:
  • Added combined matrix mode
  • Improved logging system
  • Command line interface
  1. Data Structure Updates:
  • Uses pandas DataFrames instead of R matrices
  • Optimized memory handling for large datasets
  1. Additional Features:
  • More extensive error checking
  • Progress reporting
  • Configurable preprocessing options

Installation

# Clone the repository
git clone https://github.com/yourusername/corto-metabolomics.git

# Install required packages
pip install -r requirements.txt

Usage

Data Preparation

python corto-data-prep-final.py \
    --metabolomics_file data/metabolomics.csv \
    --expression_file data/expression.txt \
    --cnv_file data/cnv.csv \
    --normalization standard \
    --outlier_detection zscore \
    --imputation knn

Network Analysis

python corto-matrix-combination-final.py \
    --mode corto \
    --expression_file prepared_expression.csv \
    --metabolomics_file prepared_metabolomics.csv \
    --p_threshold 1e-30 \
    --nbootstraps 100 \
    --nthreads 4 \
    --verbose

Key Features

Data Preprocessing

  • Zero-variance feature removal
  • CNV correction
  • Outlier detection
  • Missing value imputation
  • Sample alignment
  • Quality control metrics

Network Analysis

  • Two analysis modes:
    • 'corto': Original approach keeping matrices separate
    • 'combined': Matrix combination approach for higher-order relationships
  • Parallel processing for bootstraps
  • Memory-efficient chunked processing
  • Comprehensive result reporting

Output Files

The pipeline generates several output files:

  1. Preprocessed Data:
  • prepared_metabolomics.csv
  • prepared_expression.csv
  • prepared_metrics.txt
  1. Network Analysis:
  • corto_network_{mode}.csv: Network edges and statistics
  • corto_regulon_{mode}.txt: Regulon object with relationship details