updated readme with specific details around which R script code was implemented.

2024-12-16 15:06:25 +00:00
parent 21d77e3faa
commit b408cfd4dd
1 changed files with 70 additions and 18 deletions
--- a/README.md
+++ b/README.md
@@ -14,29 +14,80 @@ The original corto algorithm was implemented in R for analyzing gene expression
 ## Code Translation Overview
-### Original R Components:
+### Detailed Code Translation Mapping
-The project translates code from several R source files:
+#### corto-data-prep-final.py
 - `corto.R`: Core algorithm implementation
 - `functions.R`: Utility functions and statistical analysis
 - `mra.R`: Master Regulator Analysis functionality
 - `gsea.R`: Gene Set Enrichment Analysis components
-### Python Implementation:
+This script primarily implements functionality from corto.R:
-The functionality has been reorganized into two main Python scripts:
+1. Data Loading and Validation
 - Initial data loading logic from `corto()` function
 - Input validation checks in `validate_ccle_format()`
 - Initial data preprocessing steps in `preprocess_ccle_data()`
-1. `corto-data-prep-final.py`:
+2. Zero Variance Feature Handling
- Data loading and validation
+- Translates zero variance removal logic:
- Preprocessing pipeline
+```R
- CNV correction
+# From corto.R
- Quality control metrics
+if(sum(is.na(inmat))>0){
    stop("Input matrix contains NA fields")
 }
 allvars<-apply(inmat,1,var)
 keep<-names(allvars)[allvars>0]
 inmat<-inmat[keep,]
 ```
-2. `corto-matrix-combination-final.py`:
+3. CNV Correction
- Network analysis implementation
+- Implements CNV correction logic from corto.R:
- Correlation calculations
+```R
- Bootstrap analysis
+if(!is.null(cnvmat)){
- Results generation
+    commonrows<-intersect(rownames(cnvmat),rownames(inmat))
    commoncols<-intersect(colnames(cnvmat),colnames(inmat))
    cnvmat<-cnvmat[commonrows,commoncols]
    inmat<-inmat[commonrows,commoncols]
 ```
 #### corto-matrix-combination-final.py
 This script implements functionality from multiple R sources:
 1. From functions.R:
 - Direct translation of `p2r()`:
 ```R
 p2r<-function(p,n){
    t<-qt(p/2,df=n-2,lower.tail=FALSE)
    r<-sqrt((t^2)/(n-2+t^2))
    return(r)
 }
 ```
 2. From mra.R:
 - Correlation calculation logic from MRA functions
 - Bootstrap implementation approach
 3. From gsea.R:
 - Statistical analysis approaches
 - Matrix manipulation techniques
 ### Key Implementation Differences
 1. Memory Management:
 - Added chunked processing for large matrices
 - Implemented parallel processing with ProcessPoolExecutor
 2. Extended Functionality:
 - Added combined matrix mode
 - Improved logging system
 - Command line interface
 3. Data Structure Updates:
 - Uses pandas DataFrames instead of R matrices
 - Optimized memory handling for large datasets
 4. Additional Features:
 - More extensive error checking
 - Progress reporting
 - Configurable preprocessing options
 ## Installation
@@ -105,3 +156,4 @@ The pipeline generates several output files:
 2. Network Analysis:
 - `corto_network_{mode}.csv`: Network edges and statistics
 - `corto_regulon_{mode}.txt`: Regulon object with relationship details