Corto Metabolomics Analysis Pipeline
A Python implementation of the corto algorithm for analyzing metabolomics and gene expression data, translated from the original R codebase. This project provides tools for preprocessing multi-omics data and performing network analysis to identify relationships between metabolites and gene expression.
Background
The original corto algorithm was implemented in R for analyzing gene expression data and identifying master regulators. This project extends and modernizes the implementation by:
- Translating core functionality to Python
- Adding support for metabolomics data
- Implementing memory-efficient processing for large datasets
- Adding parallel processing capabilities
- Providing a robust command-line interface
Code Translation Overview
Detailed Code Translation Mapping
corto-data-prep-final.py
This script primarily implements functionality from corto.R:
- Data Loading and Validation
- Initial data loading logic from
corto()function - Input validation checks in
validate_ccle_format() - Initial data preprocessing steps in
preprocess_ccle_data()
- Zero Variance Feature Handling
- Translates zero variance removal logic:
# From corto.R
if(sum(is.na(inmat))>0){
stop("Input matrix contains NA fields")
}
allvars<-apply(inmat,1,var)
keep<-names(allvars)[allvars>0]
inmat<-inmat[keep,]
- CNV Correction
- Implements CNV correction logic from corto.R:
if(!is.null(cnvmat)){
commonrows<-intersect(rownames(cnvmat),rownames(inmat))
commoncols<-intersect(colnames(cnvmat),colnames(inmat))
cnvmat<-cnvmat[commonrows,commoncols]
inmat<-inmat[commonrows,commoncols]
corto-matrix-combination-final.py
This script implements functionality from multiple R sources:
- From functions.R:
- Direct translation of
p2r():
p2r<-function(p,n){
t<-qt(p/2,df=n-2,lower.tail=FALSE)
r<-sqrt((t^2)/(n-2+t^2))
return(r)
}
- From mra.R:
- Correlation calculation logic from MRA functions
- Bootstrap implementation approach
- From gsea.R:
- Statistical analysis approaches
- Matrix manipulation techniques
Key Implementation Differences
- Memory Management:
- Added chunked processing for large matrices
- Implemented parallel processing with ProcessPoolExecutor
- Extended Functionality:
- Added combined matrix mode
- Improved logging system
- Command line interface
- Data Structure Updates:
- Uses pandas DataFrames instead of R matrices
- Optimized memory handling for large datasets
- Additional Features:
- More extensive error checking
- Progress reporting
- Configurable preprocessing options
Installation
# Clone the repository
git clone https://github.com/yourusername/corto-metabolomics.git
# Install required packages
pip install -r requirements.txt
Usage
Data Preparation
python corto-data-prep-final.py \
--metabolomics_file data/metabolomics.csv \
--expression_file data/expression.txt \
--cnv_file data/cnv.csv \
--normalization standard \
--outlier_detection zscore \
--imputation knn
Network Analysis
python corto-matrix-combination-final.py \
--mode corto \
--expression_file prepared_expression.csv \
--metabolomics_file prepared_metabolomics.csv \
--p_threshold 1e-30 \
--nbootstraps 100 \
--nthreads 4 \
--verbose
Key Features
Data Preprocessing
- Zero-variance feature removal
- CNV correction
- Outlier detection
- Missing value imputation
- Sample alignment
- Quality control metrics
Network Analysis
- Two analysis modes:
- 'corto': Original approach keeping matrices separate
- 'combined': Matrix combination approach for higher-order relationships
- Parallel processing for bootstraps
- Memory-efficient chunked processing
- Comprehensive result reporting
Output Files
The pipeline generates several output files:
- Preprocessed Data:
prepared_metabolomics.csvprepared_expression.csvprepared_metrics.txt
- Network Analysis:
corto_network_{mode}.csv: Network edges and statisticscorto_regulon_{mode}.txt: Regulon object with relationship details
Description
Languages
Python
93.6%
Nextflow
5.1%
Dockerfile
1.3%