diff --git a/README.md b/README.md index e404812..014817a 100644 --- a/README.md +++ b/README.md @@ -14,29 +14,80 @@ The original corto algorithm was implemented in R for analyzing gene expression ## Code Translation Overview -### Original R Components: +### Detailed Code Translation Mapping -The project translates code from several R source files: -- `corto.R`: Core algorithm implementation -- `functions.R`: Utility functions and statistical analysis -- `mra.R`: Master Regulator Analysis functionality -- `gsea.R`: Gene Set Enrichment Analysis components +#### corto-data-prep-final.py -### Python Implementation: +This script primarily implements functionality from corto.R: -The functionality has been reorganized into two main Python scripts: +1. Data Loading and Validation +- Initial data loading logic from `corto()` function +- Input validation checks in `validate_ccle_format()` +- Initial data preprocessing steps in `preprocess_ccle_data()` -1. `corto-data-prep-final.py`: -- Data loading and validation -- Preprocessing pipeline -- CNV correction -- Quality control metrics +2. Zero Variance Feature Handling +- Translates zero variance removal logic: +```R +# From corto.R +if(sum(is.na(inmat))>0){ + stop("Input matrix contains NA fields") +} +allvars<-apply(inmat,1,var) +keep<-names(allvars)[allvars>0] +inmat<-inmat[keep,] +``` -2. `corto-matrix-combination-final.py`: -- Network analysis implementation -- Correlation calculations -- Bootstrap analysis -- Results generation +3. CNV Correction +- Implements CNV correction logic from corto.R: +```R +if(!is.null(cnvmat)){ + commonrows<-intersect(rownames(cnvmat),rownames(inmat)) + commoncols<-intersect(colnames(cnvmat),colnames(inmat)) + cnvmat<-cnvmat[commonrows,commoncols] + inmat<-inmat[commonrows,commoncols] +``` + +#### corto-matrix-combination-final.py + +This script implements functionality from multiple R sources: + +1. From functions.R: +- Direct translation of `p2r()`: +```R +p2r<-function(p,n){ + t<-qt(p/2,df=n-2,lower.tail=FALSE) + r<-sqrt((t^2)/(n-2+t^2)) + return(r) +} +``` + +2. From mra.R: +- Correlation calculation logic from MRA functions +- Bootstrap implementation approach + +3. From gsea.R: +- Statistical analysis approaches +- Matrix manipulation techniques + +### Key Implementation Differences + +1. Memory Management: +- Added chunked processing for large matrices +- Implemented parallel processing with ProcessPoolExecutor + +2. Extended Functionality: +- Added combined matrix mode +- Improved logging system +- Command line interface + +3. Data Structure Updates: +- Uses pandas DataFrames instead of R matrices +- Optimized memory handling for large datasets + +4. Additional Features: +- More extensive error checking +- Progress reporting +- Configurable preprocessing options ## Installation @@ -105,3 +156,4 @@ The pipeline generates several output files: 2. Network Analysis: - `corto_network_{mode}.csv`: Network edges and statistics - `corto_regulon_{mode}.txt`: Regulon object with relationship details +