Machine Learning Integrated Ensemble of Feature Selection Methods
followed by Cox Regression Survival Analysis for Predicting
Breast Cancer Subtype specific miRNA Biomarkers


Jnanendra Prasad Sarkar1,3,+, Indrajit Saha2,+,*, Anasua Sarkar3 and Ujjwal Maulik3


1Larsen & Toubro Infotech Ltd., Pune, India
2Department of Computer Science and Engineering, National Institute of Technical Teachers' Training and Research, Kolkata, India
3Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
*Correspondence should be addressed to indrajit@nitttrkol.ac.in
+These authors contributed equally to this work



ABSTRACT

Breast cancer is the second leading cancer type in female population among other different cancer types. In this regard, it is found that microRNAs play an important role by regulating the gene expression at the post-transcriptional phase. However, identification of most influencing miRNAs in breast cancer subtypes is a challenging task, while the recent advancement in Next Generation Sequencing techniques allow to analyze high throughput expression data of miRNAs. Thus, we have conducted this research with the help of NGS data of breast cancer in order to identify the most significant miRNA biomarkers which are highly associated with multiple breast cancer subtypes. For this purpose, two-phase technique, called Machine Learning Integrated Ensemble of Feature Selection Methods followed by semi-parametric cox regression survival analysis, is proposed. In the first phase, we select the best machine learning technique among seven techniques based on classification accuracy using entire set of features (in this case miRNAs). Subsequently, eight different feature selection methods are used separately in order to rank the features and validate each set of top features using the selected machine learning technique by considering a multi-class classification task of breast cancer subtypes. In the second phase, based on the classification accuracy the top features from each feature selection method are considered to make an ensemble in order to provide further categorization of miRNAs as 8*, 7* up to 1*. The 8* miRNAs provides highest average classification accuracy as 86% after 10-fold cross-validation. Thereafter, 27 miRNAs are identified from the list which is confined within 8* to 4* miRNAs based on their importance in survival for breast cancer subtypes using Cox regression analysis. Moreover, survival analysis, expression analysis, regulatory network analysis, protein-protein interaction analysis, KEGG pathway and gene ontology enrichment analysis are performed in order to validate biological significance. Additionally, we have prepared miRNA-protein-drug interaction network to identify possible drug for selected miRNAs. Thus, our findings may be considered during clinical trial for the treatment of breast cancer patients.

Supplementary


datasets


code


The algorithm is implemented in MATLAB. The code is available in zipped form here. Use of algorithm is free as long as it is used for any academic and non-commercial purpose. If you use these algorithms, please cite the following reference:

J. P. Sarkar, I. Saha, A. Sarkar and U. Maulik, "Machine Learning Integrated Ensemble of Feature Selection Methods for Predicting Breast Cancer Subtype specific miRNA Biomarkers", submitted to Computers in Biology and Medicine (2020).

For any query regarding the algorithms, please mail to indrajit@nitttrkol.ac.in