updated readme with specific details around which R script code was implemented.

This commit is contained in:
2024-12-16 15:06:25 +00:00
parent 21d77e3faa
commit b408cfd4dd

View File

@@ -14,29 +14,80 @@ The original corto algorithm was implemented in R for analyzing gene expression
## Code Translation Overview ## Code Translation Overview
### Original R Components: ### Detailed Code Translation Mapping
The project translates code from several R source files: #### corto-data-prep-final.py
- `corto.R`: Core algorithm implementation
- `functions.R`: Utility functions and statistical analysis
- `mra.R`: Master Regulator Analysis functionality
- `gsea.R`: Gene Set Enrichment Analysis components
### Python Implementation: This script primarily implements functionality from corto.R:
The functionality has been reorganized into two main Python scripts: 1. Data Loading and Validation
- Initial data loading logic from `corto()` function
- Input validation checks in `validate_ccle_format()`
- Initial data preprocessing steps in `preprocess_ccle_data()`
1. `corto-data-prep-final.py`: 2. Zero Variance Feature Handling
- Data loading and validation - Translates zero variance removal logic:
- Preprocessing pipeline ```R
- CNV correction # From corto.R
- Quality control metrics if(sum(is.na(inmat))>0){
stop("Input matrix contains NA fields")
}
allvars<-apply(inmat,1,var)
keep<-names(allvars)[allvars>0]
inmat<-inmat[keep,]
```
2. `corto-matrix-combination-final.py`: 3. CNV Correction
- Network analysis implementation - Implements CNV correction logic from corto.R:
- Correlation calculations ```R
- Bootstrap analysis if(!is.null(cnvmat)){
- Results generation commonrows<-intersect(rownames(cnvmat),rownames(inmat))
commoncols<-intersect(colnames(cnvmat),colnames(inmat))
cnvmat<-cnvmat[commonrows,commoncols]
inmat<-inmat[commonrows,commoncols]
```
#### corto-matrix-combination-final.py
This script implements functionality from multiple R sources:
1. From functions.R:
- Direct translation of `p2r()`:
```R
p2r<-function(p,n){
t<-qt(p/2,df=n-2,lower.tail=FALSE)
r<-sqrt((t^2)/(n-2+t^2))
return(r)
}
```
2. From mra.R:
- Correlation calculation logic from MRA functions
- Bootstrap implementation approach
3. From gsea.R:
- Statistical analysis approaches
- Matrix manipulation techniques
### Key Implementation Differences
1. Memory Management:
- Added chunked processing for large matrices
- Implemented parallel processing with ProcessPoolExecutor
2. Extended Functionality:
- Added combined matrix mode
- Improved logging system
- Command line interface
3. Data Structure Updates:
- Uses pandas DataFrames instead of R matrices
- Optimized memory handling for large datasets
4. Additional Features:
- More extensive error checking
- Progress reporting
- Configurable preprocessing options
## Installation ## Installation
@@ -105,3 +156,4 @@ The pipeline generates several output files:
2. Network Analysis: 2. Network Analysis:
- `corto_network_{mode}.csv`: Network edges and statistics - `corto_network_{mode}.csv`: Network edges and statistics
- `corto_regulon_{mode}.txt`: Regulon object with relationship details - `corto_regulon_{mode}.txt`: Regulon object with relationship details