Files
corto/README.md

108 lines
2.9 KiB
Markdown

# Corto Metabolomics Analysis Pipeline
A Python implementation of the corto algorithm for analyzing metabolomics and gene expression data, translated from the original R codebase. This project provides tools for preprocessing multi-omics data and performing network analysis to identify relationships between metabolites and gene expression.
## Background
The original corto algorithm was implemented in R for analyzing gene expression data and identifying master regulators. This project extends and modernizes the implementation by:
1. Translating core functionality to Python
2. Adding support for metabolomics data
3. Implementing memory-efficient processing for large datasets
4. Adding parallel processing capabilities
5. Providing a robust command-line interface
## Code Translation Overview
### Original R Components:
The project translates code from several R source files:
- `corto.R`: Core algorithm implementation
- `functions.R`: Utility functions and statistical analysis
- `mra.R`: Master Regulator Analysis functionality
- `gsea.R`: Gene Set Enrichment Analysis components
### Python Implementation:
The functionality has been reorganized into two main Python scripts:
1. `corto-data-prep-final.py`:
- Data loading and validation
- Preprocessing pipeline
- CNV correction
- Quality control metrics
2. `corto-matrix-combination-final.py`:
- Network analysis implementation
- Correlation calculations
- Bootstrap analysis
- Results generation
## Installation
```bash
# Clone the repository
git clone https://github.com/yourusername/corto-metabolomics.git
# Install required packages
pip install -r requirements.txt
```
## Usage
### Data Preparation
```bash
python corto-data-prep-final.py \
--metabolomics_file data/metabolomics.csv \
--expression_file data/expression.txt \
--cnv_file data/cnv.csv \
--normalization standard \
--outlier_detection zscore \
--imputation knn
```
### Network Analysis
```bash
python corto-matrix-combination-final.py \
--mode corto \
--expression_file prepared_expression.csv \
--metabolomics_file prepared_metabolomics.csv \
--p_threshold 1e-30 \
--nbootstraps 100 \
--nthreads 4 \
--verbose
```
## Key Features
### Data Preprocessing
- Zero-variance feature removal
- CNV correction
- Outlier detection
- Missing value imputation
- Sample alignment
- Quality control metrics
### Network Analysis
- Two analysis modes:
- 'corto': Original approach keeping matrices separate
- 'combined': Matrix combination approach for higher-order relationships
- Parallel processing for bootstraps
- Memory-efficient chunked processing
- Comprehensive result reporting
## Output Files
The pipeline generates several output files:
1. Preprocessed Data:
- `prepared_metabolomics.csv`
- `prepared_expression.csv`
- `prepared_metrics.txt`
2. Network Analysis:
- `corto_network_{mode}.csv`: Network edges and statistics
- `corto_regulon_{mode}.txt`: Regulon object with relationship details