Integrated Aanalysis Breast Cancer Subtypes

Deep Learning for Integrated Analysis of Breast Cancer Subtype Specific Multi-omics Data

Somnath Rakshit^1,+, Subha Sankar Chakraborty^2,+, Indrajit Saha^2,+,* and Dariusz Plewczynski^3,4

¹Department of Computer Science and Engineering, Jalpaiguri Government Engineering College, Jalpaiguri, India
²Department of Computer Science and Engineering, National Institute of Technical Teachers' Training and Research, Kolkata, India
³Centre of New Technologies, University of Warsaw, Poland
⁴Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
*Correspondence should be addressed to indrajit@nitttrkol.ac.in
⁺These authors contributed equally to this work

ABSTRACT

Breast cancer is a deadly disease which commonly occurs all over the world and has been found to be the largest cause of cancer in females. Its detection is still a major challenge, both from a computational and biological point of views. Next Generation Sequencing (NGS) techniques have accelerated the mapping of human genomes rapidly. Involvement of advanced NGS techniques reveals that multiple genetic molecules are responsible for the cause of breast cancer and its subtypes. However, the high volume of data that is produced by the NGS techniques is difficult to study because of their high dimensionality and complexity. Thus, the integrated study of multi-omics data is one of the major challenges in medical science. This fact motivated us to study the NGS based high throughput expression data of miRNAs and mRNAs as well as Beta values of DNA Methylation of the corresponding mRNAs. In this regard, first, these datasets, together consisting of 33564 features of 305 patients in five classes viz. Luminal A, Luminal B, HER2-enriched, Basal-like and Control, are analysed in an integrated fashion using deep learning technique to classify the breast cancer subtypes properly. Second, the results of the deep learning technique are further analysed in order to identify the deeply connected features, i.e. either miRNA or mRNA or DNA Methylation, which are pivotal in the classification of breast cancer subtypes as well as play a crucial role in its formation. For this purpose, a deep learning technique, called stacked autoencoder is used to encode/transform the features into a low dimensional space, which is then fed to the five well known classifiers for classification. Moreover, the same encoded data is used to select the potential features after performing multiplication with the original data and Bonferroni correction on the p-values produced by the one-sample t-test. The results have been validated quantitatively and through biological significance analysis where oncogene TP53 and tumor suppression gene BRCA1 have been found. These genes are known to play a crucial role in breast cancer.

code

The method is implemented in Python. The code is available in zipped form here. Use of algorithm is free as long as it is used for any academic and non-commercial purpose. If you use these algorithms, please cite the following reference:

S. Rakshit, S. S. Chakraborty, I. Saha and D. Plewczynski, "Deep Learning for Integrated Analysis of Breast Cancer Subtype Specific Multi-omics Data", In Proc. of IEEE Region 10 Conference TENCON, Jeju, South Korea, pp. 1911-1916, Ocotber 2018..

For any query regarding the algorithms, please mail to indrajit@nitttrkol.ac.in

Supplementary

datasets

code