Genome-wide analysis of Indian SARS-CoV-2 Genomes for the
Identification of Genetic Mutation and SNP


Indrajit Saha1,+,*,Nimisha Ghosh2,+, Debasree Maity3, Nikhil Sharma 4 Jnanendra Prasad Sarkar 5,6, Kaushik Mitra7


1Department of Computer Science and Engineering, National Institute of Technical Teachers' Training and Research, Kolkata, India
2Department of Computer Science and Information Technology, Institute of Technical Education and Research,
Siksha 'O' Anusandhan (Deemed to be University), Bhubaneswar, India
3MCKV Institute of Engineering, Liluah, Howrah, India
4Department of Electronics and Communication Engineering, Jaypee Institute of Information Technology, Noida, India
5Larsen & Toubro Infotech, Pune, India
6Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
7Department of Community Medicine, Burdwan Medical College, Barddhaman, India
*Correspondence should be addressed to team leader : indrajit@nitttrkol.ac.in
+These team members contributed equally to this work



ABSTRACT

The wave of COVID-19 is a big threat to the human population. Presently, the world is going through different phases of lock down in order to stop this wave of pandemic; India being no exception. We have also started the lock down on 23rd March, 2020. In this current situation, apart from social distancing only a vaccine can be the proper solution to serve the population of human being. Thus it is important for all the nations to perform the genome-wide analysis in order to identify the genetic variation in Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2) so that proper vaccine can be designed. This fast motivated us to analyze publicly available 566 Indian complete or near complete SARS-CoV-2 genomes to find the mutation points as substitution, deletion and insertion. In this regard, we have performed the multiple sequence alignment in presence of reference sequence from NCBI. After the alignment, a consensus sequence is build to analyze each genome in order to identify the mutation points. As a consequence, we have found 933 substitutions, 2449 deletions and 2 insertions, in total 3384 unique mutation points, in 566 genomes across 29.9K bp. Further, it has been classified into three groups as 100 clusters of mutations (mostly deletions), 1609 point mutations as substitution, deletion and insertion and 64 SNPs. These outcomes are visualized using BioCircos and bar plots as well as plotting entropy value of each genomic location. Moreover, phylogenetic analysis has also been performed to see the evolution of SARS-CoV-2 virus in India. It also shows the wide variation in tree which indeed vivid in genomic analysis. Finally, these SNPs can be the useful target for virus classification, designing and defining the effective dose of vaccine for the heterogeneous population.

Supplementary


dataset


code


The code/technique/algorithm is developed in MATLAB. The code is available in zipped form here. Use of code/technique/algorithm is free as long as it is used for any academic and non-commercial purpose. If you use this code/technique/algorithm, please cite this work.

For any query regarding the code/technique/algorithm, please mail to indrajit@nitttrkol.ac.in

Disclaimer:
The virus genomes are collected from public database like GISAID. Thus, NITTTR, Kolkata does not own any responsibility.