# Corto Metabolomics Analysis Pipeline A Python implementation of the corto algorithm for analyzing metabolomics and gene expression data, translated from the original R codebase. This project provides tools for preprocessing multi-omics data and performing network analysis to identify relationships between metabolites and gene expression. ## Background The original corto algorithm was implemented in R for analyzing gene expression data and identifying master regulators. This project extends and modernizes the implementation by: 1. Translating core functionality to Python 2. Adding support for metabolomics data 3. Implementing memory-efficient processing for large datasets 4. Adding parallel processing capabilities 5. Providing a robust command-line interface ## Code Translation Overview ### Original R Components: The project translates code from several R source files: - `corto.R`: Core algorithm implementation - `functions.R`: Utility functions and statistical analysis - `mra.R`: Master Regulator Analysis functionality - `gsea.R`: Gene Set Enrichment Analysis components ### Python Implementation: The functionality has been reorganized into two main Python scripts: 1. `corto-data-prep-final.py`: - Data loading and validation - Preprocessing pipeline - CNV correction - Quality control metrics 2. `corto-matrix-combination-final.py`: - Network analysis implementation - Correlation calculations - Bootstrap analysis - Results generation ## Installation ```bash # Clone the repository git clone https://github.com/yourusername/corto-metabolomics.git # Install required packages pip install -r requirements.txt ``` ## Usage ### Data Preparation ```bash python corto-data-prep-final.py \ --metabolomics_file data/metabolomics.csv \ --expression_file data/expression.txt \ --cnv_file data/cnv.csv \ --normalization standard \ --outlier_detection zscore \ --imputation knn ``` ### Network Analysis ```bash python corto-matrix-combination-final.py \ --mode corto \ --expression_file prepared_expression.csv \ --metabolomics_file prepared_metabolomics.csv \ --p_threshold 1e-30 \ --nbootstraps 100 \ --nthreads 4 \ --verbose ``` ## Key Features ### Data Preprocessing - Zero-variance feature removal - CNV correction - Outlier detection - Missing value imputation - Sample alignment - Quality control metrics ### Network Analysis - Two analysis modes: - 'corto': Original approach keeping matrices separate - 'combined': Matrix combination approach for higher-order relationships - Parallel processing for bootstraps - Memory-efficient chunked processing - Comprehensive result reporting ## Output Files The pipeline generates several output files: 1. Preprocessed Data: - `prepared_metabolomics.csv` - `prepared_expression.csv` - `prepared_metrics.txt` 2. Network Analysis: - `corto_network_{mode}.csv`: Network edges and statistics - `corto_regulon_{mode}.txt`: Regulon object with relationship details