# Corto Metabolomics Analysis Pipeline A Python implementation of the corto algorithm for analyzing metabolomics and gene expression data, translated from the original R codebase. This project provides tools for preprocessing multi-omics data and performing network analysis to identify relationships between metabolites and gene expression. ## Background The original corto algorithm was implemented in R for analyzing gene expression data and identifying master regulators. This project extends and modernizes the implementation by: 1. Translating core functionality to Python 2. Adding support for metabolomics data 3. Implementing memory-efficient processing for large datasets 4. Adding parallel processing capabilities 5. Providing a robust command-line interface ## Code Translation Overview ### Detailed Code Translation Mapping #### corto-data-prep-final.py This script primarily implements functionality from corto.R: 1. Data Loading and Validation - Initial data loading logic from `corto()` function - Input validation checks in `validate_ccle_format()` - Initial data preprocessing steps in `preprocess_ccle_data()` 2. Zero Variance Feature Handling - Translates zero variance removal logic: ```R # From corto.R if(sum(is.na(inmat))>0){ stop("Input matrix contains NA fields") } allvars<-apply(inmat,1,var) keep<-names(allvars)[allvars>0] inmat<-inmat[keep,] ``` 3. CNV Correction - Implements CNV correction logic from corto.R: ```R if(!is.null(cnvmat)){ commonrows<-intersect(rownames(cnvmat),rownames(inmat)) commoncols<-intersect(colnames(cnvmat),colnames(inmat)) cnvmat<-cnvmat[commonrows,commoncols] inmat<-inmat[commonrows,commoncols] ``` #### corto-matrix-combination-final.py This script implements functionality from multiple R sources: 1. From functions.R: - Direct translation of `p2r()`: ```R p2r<-function(p,n){ t<-qt(p/2,df=n-2,lower.tail=FALSE) r<-sqrt((t^2)/(n-2+t^2)) return(r) } ``` 2. From mra.R: - Correlation calculation logic from MRA functions - Bootstrap implementation approach 3. From gsea.R: - Statistical analysis approaches - Matrix manipulation techniques ### Key Implementation Differences 1. Memory Management: - Added chunked processing for large matrices - Implemented parallel processing with ProcessPoolExecutor 2. Extended Functionality: - Added combined matrix mode - Improved logging system - Command line interface 3. Data Structure Updates: - Uses pandas DataFrames instead of R matrices - Optimized memory handling for large datasets 4. Additional Features: - More extensive error checking - Progress reporting - Configurable preprocessing options ## Installation ```bash # Clone the repository git clone https://github.com/yourusername/corto-metabolomics.git # Install required packages pip install -r requirements.txt ``` ## Usage ### Data Preparation ```bash python corto-data-prep-final.py \ --metabolomics_file data/metabolomics.csv \ --expression_file data/expression.txt \ --cnv_file data/cnv.csv \ --normalization standard \ --outlier_detection zscore \ --imputation knn ``` ### Network Analysis ```bash python corto-matrix-combination-final.py \ --mode corto \ --expression_file prepared_expression.csv \ --metabolomics_file prepared_metabolomics.csv \ --p_threshold 1e-30 \ --nbootstraps 100 \ --nthreads 4 \ --verbose ``` ## Key Features ### Data Preprocessing - Zero-variance feature removal - CNV correction - Outlier detection - Missing value imputation - Sample alignment - Quality control metrics ### Network Analysis - Two analysis modes: - 'corto': Original approach keeping matrices separate - 'combined': Matrix combination approach for higher-order relationships - Parallel processing for bootstraps - Memory-efficient chunked processing - Comprehensive result reporting ## Output Files The pipeline generates several output files: 1. Preprocessed Data: - `prepared_metabolomics.csv` - `prepared_expression.csv` - `prepared_metrics.txt` 2. Network Analysis: - `corto_network_{mode}.csv`: Network edges and statistics - `corto_regulon_{mode}.txt`: Regulon object with relationship details