160 lines
4.1 KiB
Markdown
Executable File
160 lines
4.1 KiB
Markdown
Executable File
# Corto Metabolomics Analysis Pipeline
|
|
|
|
A Python implementation of the corto algorithm for analyzing metabolomics and gene expression data, translated from the original R codebase. This project provides tools for preprocessing multi-omics data and performing network analysis to identify relationships between metabolites and gene expression.
|
|
|
|
## Background
|
|
|
|
The original corto algorithm was implemented in R for analyzing gene expression data and identifying master regulators. This project extends and modernizes the implementation by:
|
|
|
|
1. Translating core functionality to Python
|
|
2. Adding support for metabolomics data
|
|
3. Implementing memory-efficient processing for large datasets
|
|
4. Adding parallel processing capabilities
|
|
5. Providing a robust command-line interface
|
|
|
|
## Code Translation Overview
|
|
|
|
### Detailed Code Translation Mapping
|
|
|
|
#### corto-data-prep-final.py
|
|
|
|
This script primarily implements functionality from corto.R:
|
|
|
|
1. Data Loading and Validation
|
|
- Initial data loading logic from `corto()` function
|
|
- Input validation checks in `validate_ccle_format()`
|
|
- Initial data preprocessing steps in `preprocess_ccle_data()`
|
|
|
|
2. Zero Variance Feature Handling
|
|
- Translates zero variance removal logic:
|
|
```R
|
|
# From corto.R
|
|
if(sum(is.na(inmat))>0){
|
|
stop("Input matrix contains NA fields")
|
|
}
|
|
allvars<-apply(inmat,1,var)
|
|
keep<-names(allvars)[allvars>0]
|
|
inmat<-inmat[keep,]
|
|
```
|
|
|
|
3. CNV Correction
|
|
- Implements CNV correction logic from corto.R:
|
|
```R
|
|
if(!is.null(cnvmat)){
|
|
commonrows<-intersect(rownames(cnvmat),rownames(inmat))
|
|
commoncols<-intersect(colnames(cnvmat),colnames(inmat))
|
|
cnvmat<-cnvmat[commonrows,commoncols]
|
|
inmat<-inmat[commonrows,commoncols]
|
|
```
|
|
|
|
#### corto-matrix-combination-final.py
|
|
|
|
This script implements functionality from multiple R sources:
|
|
|
|
1. From functions.R:
|
|
- Direct translation of `p2r()`:
|
|
```R
|
|
p2r<-function(p,n){
|
|
t<-qt(p/2,df=n-2,lower.tail=FALSE)
|
|
r<-sqrt((t^2)/(n-2+t^2))
|
|
return(r)
|
|
}
|
|
```
|
|
|
|
2. From mra.R:
|
|
- Correlation calculation logic from MRA functions
|
|
- Bootstrap implementation approach
|
|
|
|
3. From gsea.R:
|
|
- Statistical analysis approaches
|
|
- Matrix manipulation techniques
|
|
|
|
### Key Implementation Differences
|
|
|
|
1. Memory Management:
|
|
- Added chunked processing for large matrices
|
|
- Implemented parallel processing with ProcessPoolExecutor
|
|
|
|
2. Extended Functionality:
|
|
- Added combined matrix mode
|
|
- Improved logging system
|
|
- Command line interface
|
|
|
|
3. Data Structure Updates:
|
|
- Uses pandas DataFrames instead of R matrices
|
|
- Optimized memory handling for large datasets
|
|
|
|
4. Additional Features:
|
|
- More extensive error checking
|
|
- Progress reporting
|
|
- Configurable preprocessing options
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
# Clone the repository
|
|
git clone https://github.com/yourusername/corto-metabolomics.git
|
|
|
|
# Install required packages
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Data Preparation
|
|
|
|
```bash
|
|
python corto-data-prep-final.py \
|
|
--metabolomics_file data/metabolomics.csv \
|
|
--expression_file data/expression.txt \
|
|
--cnv_file data/cnv.csv \
|
|
--normalization standard \
|
|
--outlier_detection zscore \
|
|
--imputation knn
|
|
```
|
|
|
|
### Network Analysis
|
|
|
|
```bash
|
|
python corto-matrix-combination-final.py \
|
|
--mode corto \
|
|
--expression_file prepared_expression.csv \
|
|
--metabolomics_file prepared_metabolomics.csv \
|
|
--p_threshold 1e-30 \
|
|
--nbootstraps 100 \
|
|
--nthreads 4 \
|
|
--verbose
|
|
```
|
|
|
|
## Key Features
|
|
|
|
### Data Preprocessing
|
|
- Zero-variance feature removal
|
|
- CNV correction
|
|
- Outlier detection
|
|
- Missing value imputation
|
|
- Sample alignment
|
|
- Quality control metrics
|
|
|
|
### Network Analysis
|
|
- Two analysis modes:
|
|
- 'corto': Original approach keeping matrices separate
|
|
- 'combined': Matrix combination approach for higher-order relationships
|
|
- Parallel processing for bootstraps
|
|
- Memory-efficient chunked processing
|
|
- Comprehensive result reporting
|
|
|
|
## Output Files
|
|
|
|
The pipeline generates several output files:
|
|
|
|
1. Preprocessed Data:
|
|
- `prepared_metabolomics.csv`
|
|
- `prepared_expression.csv`
|
|
- `prepared_metrics.txt`
|
|
|
|
2. Network Analysis:
|
|
- `corto_network_{mode}.csv`: Network edges and statistics
|
|
- `corto_regulon_{mode}.txt`: Regulon object with relationship details
|
|
|