Characterisation of SARS-CoV-2 Clades based on Signature SNPs
unveils Continuous Evolution


Nimisha Ghosh1,+,Indrajit Saha2,+,*, Nikhil Sharma 3,+, Suman Nandi3,+

1Department of Computer Science and Information Technology, Institute of Technical Education and Research,
Siksha 'O' Anusandhan (Deemed to be University), Bhubaneswar, India
2Department of Computer Science and Engineering, National Institute of Technical Teachers' Training and Research, Kolkata, India
3Department of Electronics and Communication Engineering, Jaypee Institute of Information Technology, Noida, Uttar Pradesh, India
*Correspondence should be addressed to team leader : indrajit@nitttrkol.ac.in
+These team members contributed equally to this work



ABSTRACT

Since the emergence of SARS-CoV-2 inWuhan, China almost a year ago, it has spread across the world in a very short span of time. Although, different forms of vaccines are being rolled out for vaccination programs around the globe, the mutation of the virus is still a cause of concern among the research communities. Hence, it is important to study the constantly evolving virus and its strains in order to provide a much more stable form of cure. This fact motivated us to conduct this research where we have performed phylogenetic analyses of 15359 and 3033 global excluding India and Indian SARS-CoV-2 genomes to identify clade specific Signature Single Nucleotide Polymorphism (SNP). In this regard, Nextstrain is considered which uses MAFFT to align the SARS-CoV-2 genomes and subsequent mutations points as SNPs are identified in the genomes. Using these SNPs, the virus strains are found to be distributed among 5 major clades or clusters viz. 19A, 19B, 20A, 20B and 20C. Thereafter, from each clade top 10 signature SNPs are identified based on their frequency. As a result, 50 such signature SNPs are individually identified for global excluding India and Indian SARS-CoV-2 genomes respectively. Out of each 50 signature SNPs, 39 and 41 unique SNPs are identified among which 25 non-synonymous signature SNPs (out of 39) resulted in 30 amino acid changes in protein while 27 changes in amino acid are identified from 22 non-synonymous signature SNPs (out of 41). These 30 and 27 amino acid changes for the non-synonymous signature SNPs are visualised in their respective protein structure as well. Finally, the sequence and structural homology-based prediction along with the protein structural stability of the amino acid changes for the non-synonymous signature SNPs are evaluated using PROVEAN, PolyPhen 2.0 and I-Mutant 2.0 in order to judge the characteristics of the identified clades. As a consequence, for global excluding India, G251V in ORF3a in clade 19A, F308Y and G196V in NSP4 and ORF3a in 19B are the unique amino acid changes which are responsible for defining each clade as they are all deleterious and unstable. Such changes which are common for both global excluding India and India are R203M in Nucleocapsid for 20B, T85I and Q57H in NSP2 and ORF3a respectively for 20C while for India such unique changes are A97V in RdRp, G339S and G339C in NSP2 in 19A and Q57H in ORF3a in 20A.

Evolution of SARS-CoV-2


      

Clade wise Evolution of Global 15359 SARS-CoV-2 Genomes

      

Clade wise Transmission of Global 15359 SARS-CoV-2 Genomes

      

Clade wise Evolution of Indian 3033 SARS-CoV-2 Genomes

Supplementary


datasets


code


The algorithm is implemented in MATLAB and Python. The code is available on request. Use of code/technique/algorithm is free as long as it is used for any academic and non-commercial purpose. If you use this code/technique/algorithm, please cite this work.

For any query regarding the algorithms, please mail to indrajit@nitttrkol.ac.in

Disclaimer:
The dataset is used from public database like GISAID. Thus, NITTTR, Kolkata does not own any responsible for its accuracy.