COVID-DeepPredictor: Recurrent Neural Network to Predict
SARS-CoV-2 and Other Pathogenic Viruses


Indrajit Saha1,+,*,Nimisha Ghosh2,+, Debasree Maity3, Arijit Seal4, Dariusz Plewczynski 5,6


1Department of Computer Science and Engineering, National Institute of Technical Teachers' Training and Research, Kolkata, India
2Department of Computer Science and Information Technology, Institute of Technical Education and Research,
Siksha 'O' Anusandhan (Deemed to be University), Bhubaneswar, Odisha, India
3MCKV Institute of Engineering, Liluah, Howrah, India
4Cognizant Technology Solutions Pvt.Ltd, Kolkata, India
5Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Poland
6Faculty of Mathematics and Information Science, Warsaw Technical University, Poland
*Correspondence should be addressed to team leader : indrajit@nitttrkol.ac.in
+These team members contributed equally to this work



ABSTRACT

The COVID-19 disease for Novel coronavirus (SARS-CoV-2) has turned out to be a global pandemic. The high transmission rate of this pathogenic virus demands an early prediction and proper identification for the subsequent treatment. However, the polymorphic nature of this virus allows it to adapt and sustain in different kinds of environments which makes it difficult to predict. On the other hand, there are other pathogens like SARS-CoV-1, MERS-CoV, Ebola, Dengue and Influenza as well, so that a predictor is highly required to distinguish them with the use of their genomic information. To mitigate this problem, in this work COVID-DeepPredictor is proposed on the framework of deep learning to identify an unknown sequence of these pathogens. COVID-DeepPredictor uses Long Short Term Memory as Recurrent Neural Network for the underlying prediction with an alignment-free technique. In this regard, k-mer technique is applied to create Bag-of-Descriptors (BoDs) in order to generate Bag-of-Unique-Descriptors (BoUDs) as vocabulary and subsequently embedded representation is prepared for the given virus sequences. This predictor is not only validated for the dataset using K-fold cross-validation but also for unseen test datasets of SARS-CoV-2 sequences and sequences from other viruses as well. To verify the efficacy of COVID-DeepPredictor, it has been compared with other state-of-the-art prediction techniques based on Linear Discriminant Analysis, Random Forests and Gradient Boosting Method. COVID-DeepPredictor achieves 100% prediction accuracy on validation dataset while on test datasets, the accuracy ranges from 99.51% to 99.94%. It shows superior results over other prediction techniques as well. In addition to this, accuracy and runtime of COVID-DeepPredictor are considered simultaneously to determine the value of k in k-mer, a comparative study among k values in k-mer, Bag-of-Descriptors (BoDs) and Bag-of-Unique-Descriptors (BoUDs) and a comparison between COVID-DeepPredictor and Nucleotide BLAST have also been performed.

Supplementary


datasets


                      Training Data       Test Data

Software


The software is developed in MATLAB. The software is available in zipped form here. Use of code/technique/algorithm is free as long as it is used for any academic and non-commercial purpose. If you use this code/technique/algorithm, please cite this work.

For any query regarding the code/technique/algorithm, please mail to indrajit@nitttrkol.ac.in

Disclaimer:
The virus genomes are collected from public databases like NCBI and GISAID to develop the COVID-DeepPredictor. Thus, NITTTR, Kolkata does not own any responsible for its prediction accuracy.