updated readme with specific details around which R script code was implemented.
This commit is contained in:
88
README.md
88
README.md
@@ -14,29 +14,80 @@ The original corto algorithm was implemented in R for analyzing gene expression
|
||||
|
||||
## Code Translation Overview
|
||||
|
||||
### Original R Components:
|
||||
### Detailed Code Translation Mapping
|
||||
|
||||
The project translates code from several R source files:
|
||||
- `corto.R`: Core algorithm implementation
|
||||
- `functions.R`: Utility functions and statistical analysis
|
||||
- `mra.R`: Master Regulator Analysis functionality
|
||||
- `gsea.R`: Gene Set Enrichment Analysis components
|
||||
#### corto-data-prep-final.py
|
||||
|
||||
### Python Implementation:
|
||||
This script primarily implements functionality from corto.R:
|
||||
|
||||
The functionality has been reorganized into two main Python scripts:
|
||||
1. Data Loading and Validation
|
||||
- Initial data loading logic from `corto()` function
|
||||
- Input validation checks in `validate_ccle_format()`
|
||||
- Initial data preprocessing steps in `preprocess_ccle_data()`
|
||||
|
||||
1. `corto-data-prep-final.py`:
|
||||
- Data loading and validation
|
||||
- Preprocessing pipeline
|
||||
- CNV correction
|
||||
- Quality control metrics
|
||||
2. Zero Variance Feature Handling
|
||||
- Translates zero variance removal logic:
|
||||
```R
|
||||
# From corto.R
|
||||
if(sum(is.na(inmat))>0){
|
||||
stop("Input matrix contains NA fields")
|
||||
}
|
||||
allvars<-apply(inmat,1,var)
|
||||
keep<-names(allvars)[allvars>0]
|
||||
inmat<-inmat[keep,]
|
||||
```
|
||||
|
||||
2. `corto-matrix-combination-final.py`:
|
||||
- Network analysis implementation
|
||||
- Correlation calculations
|
||||
- Bootstrap analysis
|
||||
- Results generation
|
||||
3. CNV Correction
|
||||
- Implements CNV correction logic from corto.R:
|
||||
```R
|
||||
if(!is.null(cnvmat)){
|
||||
commonrows<-intersect(rownames(cnvmat),rownames(inmat))
|
||||
commoncols<-intersect(colnames(cnvmat),colnames(inmat))
|
||||
cnvmat<-cnvmat[commonrows,commoncols]
|
||||
inmat<-inmat[commonrows,commoncols]
|
||||
```
|
||||
|
||||
#### corto-matrix-combination-final.py
|
||||
|
||||
This script implements functionality from multiple R sources:
|
||||
|
||||
1. From functions.R:
|
||||
- Direct translation of `p2r()`:
|
||||
```R
|
||||
p2r<-function(p,n){
|
||||
t<-qt(p/2,df=n-2,lower.tail=FALSE)
|
||||
r<-sqrt((t^2)/(n-2+t^2))
|
||||
return(r)
|
||||
}
|
||||
```
|
||||
|
||||
2. From mra.R:
|
||||
- Correlation calculation logic from MRA functions
|
||||
- Bootstrap implementation approach
|
||||
|
||||
3. From gsea.R:
|
||||
- Statistical analysis approaches
|
||||
- Matrix manipulation techniques
|
||||
|
||||
### Key Implementation Differences
|
||||
|
||||
1. Memory Management:
|
||||
- Added chunked processing for large matrices
|
||||
- Implemented parallel processing with ProcessPoolExecutor
|
||||
|
||||
2. Extended Functionality:
|
||||
- Added combined matrix mode
|
||||
- Improved logging system
|
||||
- Command line interface
|
||||
|
||||
3. Data Structure Updates:
|
||||
- Uses pandas DataFrames instead of R matrices
|
||||
- Optimized memory handling for large datasets
|
||||
|
||||
4. Additional Features:
|
||||
- More extensive error checking
|
||||
- Progress reporting
|
||||
- Configurable preprocessing options
|
||||
|
||||
## Installation
|
||||
|
||||
@@ -105,3 +156,4 @@ The pipeline generates several output files:
|
||||
2. Network Analysis:
|
||||
- `corto_network_{mode}.csv`: Network edges and statistics
|
||||
- `corto_regulon_{mode}.txt`: Regulon object with relationship details
|
||||
|
||||
|
||||
Reference in New Issue
Block a user