Genome-wide Analysis of Indian SARS-CoV-2 Genomes to Identify T-cell and
B-cell Epitopes from Conserved Regions based on Immunogenicity and Antigenicity


Nimisha Ghosh1,+, Nikhil Sharma 2,+, Indrajit Saha3,+,*, Sudipto Saha4


1Department of Computer Science and Information Technology, Institute of Technical Education and Research,
Siksha 'O' Anusandhan (Deemed to be University), Bhubaneswar, India
2Department of Electronics and Communication Engineering, Jaypee Institute of Information Technology, Noida, India
3Department of Computer Science and Engineering, National Institute of Technical Teachers' Training and Research, Kolkata, India
4Division of Bioinformatics Bose Institute, Kolkata, West Bengal, India
*Correspondence should be addressed to team leader : indrajit@nitttrkol.ac.in
+These team members contributed equally to this work



ABSTRACT

The world has come to a sudden halt due to the impact of COVID-19, the contagious disease caused by SARS-CoV-2 virus. The novel coronavirus has a high transmission rate and shows frequent mutations. As a consequence, vaccine development is an arduous task for SARS-CoV-2. However, researchers around the globe are working hard to find a solution e.g. synthetic vaccine. In India, we are witnessing a rise in the infected cases at an alarming rate. Thus, a quick and reliable vaccine design is the major focus of the scientific community. Here, we have performed genome-wide analysis of 566 Indian SARS-CoV-2 genomes to extract the potential conserved regions for identifying peptide based synthetic vaccines, viz. epitopes with high immunogenicity and antigenicity. In this regard, we have used multiple sequence alignment techniques viz. ClustalW, MUSCLE, ClustalO and MAFFT to align the SARS-CoV-2 genomes separately. Subsequently, consensus conserved regions are identified after finding the conserved regions from each aligned result of alignment techniques. Further, the consensus conserved regions are refined based on the criteria that their lengths are greater than or equal to 60nt and their corresponding proteins are devoid of any stop codons. Subsequently, their specificity is verified using Nucleotide BLAST. Finally, with these consensus conserved regions, T-cell and B-cell epitopes are identified based on their immunogenic and antigenic scores. These scores are then used to rank the conserved regions. As a result, we have ranked 23 consensus conserved regions that are associated with Leader protein, NSP2, NSP3, NSP4, 3CL-Proteinase, NSP10, RNA-directed RNA polymerase, Helicase, Spike glycoprotein and Nucleocapsid protein. This ranking also resulted in 34 MHC-I and 37 MHC-II restricted T-cell epitopes with 16 and 19 unique HLA alleles and 29 B-cell epitopes for the 23 consensus conserved regions. After ranking, we have obtained the consensus conserved region from NSP3 gene that is highly immunogenic and antigenic. It provides MHC-I and MHC-II restricted T-cell epitopes and B-cell epitopes, FLKKDAPYI, ITFLKKDAPYIVGDV, TLVSDIDITFLKKDAP as most immunogenic and TAVVIPTKK, IDITFLKKDAPYIVG, LHPDSATLVSDIDITF as most antigenic respectively. In order to judge the relevance of these epitopes, the physico-chemical properties and binding conformation of the MHC-I and MHC-II restricted T-cell epitopes are shown with respect to HLA alleles.

Supplementary


dataset


code


The algorithm is implemented in MATLAB. The code is available in zipped form here. Use of code/technique/algorithm is free as long as it is used for any academic and non-commercial purpose. If you use this code/technique/algorithm, please cite this work.

For any query regarding the algorithms, please mail to indrajit@nitttrkol.ac.in

Disclaimer:
The dataset is used from public database like GISAID to conduct this reseach. Thus, NITTTR, Kolkata does not own any responsibility.