updated readme with specific details around which R script code was implemented.

2024-12-16 15:06:25 +00:00
parent 21d77e3faa
commit b408cfd4dd
1 changed files with 70 additions and 18 deletions
--- a/README.md
+++ b/README.md
@@ -14,29 +14,80 @@ The original corto algorithm was implemented in R for analyzing gene expression

 ## Code Translation Overview

-### Original R Components:
+### Detailed Code Translation Mapping

-The project translates code from several R source files:
- `corto.R`: Core algorithm implementation
- `functions.R`: Utility functions and statistical analysis
- `mra.R`: Master Regulator Analysis functionality
- `gsea.R`: Gene Set Enrichment Analysis components
+#### corto-data-prep-final.py

-### Python Implementation:
+This script primarily implements functionality from corto.R:

-The functionality has been reorganized into two main Python scripts:
+1. Data Loading and Validation
+- Initial data loading logic from `corto()` function
+- Input validation checks in `validate_ccle_format()`
+- Initial data preprocessing steps in `preprocess_ccle_data()`

-1. `corto-data-prep-final.py`:
- Data loading and validation
- Preprocessing pipeline
- CNV correction
- Quality control metrics
+2. Zero Variance Feature Handling
+- Translates zero variance removal logic:
+```R
+# From corto.R
+if(sum(is.na(inmat))>0){
+    stop("Input matrix contains NA fields")
+}
+allvars<-apply(inmat,1,var)
+keep<-names(allvars)[allvars>0]
+inmat<-inmat[keep,]
+```

-2. `corto-matrix-combination-final.py`:
- Network analysis implementation
- Correlation calculations
- Bootstrap analysis
- Results generation
+3. CNV Correction
+- Implements CNV correction logic from corto.R:
+```R
+if(!is.null(cnvmat)){
+    commonrows<-intersect(rownames(cnvmat),rownames(inmat))
+    commoncols<-intersect(colnames(cnvmat),colnames(inmat))
+    cnvmat<-cnvmat[commonrows,commoncols]
+    inmat<-inmat[commonrows,commoncols]
+```
+
+#### corto-matrix-combination-final.py
+
+This script implements functionality from multiple R sources:
+
+1. From functions.R:
+- Direct translation of `p2r()`:
+```R
+p2r<-function(p,n){
+    t<-qt(p/2,df=n-2,lower.tail=FALSE)
+    r<-sqrt((t^2)/(n-2+t^2))
+    return(r)
+}
+```
+
+2. From mra.R:
+- Correlation calculation logic from MRA functions
+- Bootstrap implementation approach
+
+3. From gsea.R:
+- Statistical analysis approaches
+- Matrix manipulation techniques
+
+### Key Implementation Differences
+
+1. Memory Management:
+- Added chunked processing for large matrices
+- Implemented parallel processing with ProcessPoolExecutor
+
+2. Extended Functionality:
+- Added combined matrix mode
+- Improved logging system
+- Command line interface
+
+3. Data Structure Updates:
+- Uses pandas DataFrames instead of R matrices
+- Optimized memory handling for large datasets
+
+4. Additional Features:
+- More extensive error checking
+- Progress reporting
+- Configurable preprocessing options

 ## Installation

@@ -105,3 +156,4 @@ The pipeline generates several output files:
 2. Network Analysis:
 - `corto_network_{mode}.csv`: Network edges and statistics
 - `corto_regulon_{mode}.txt`: Regulon object with relationship details
+