about project



This project focuses on enhancing the detection of pathogenic strains directly from metagenomic data, a crucial step in advancing infectious disease diagnostics and surveillance. Unlike traditional methods that rely on culturing or prior isolation, metagenomics enables the analysis of entire microbial communities, including rare or low abundance pathogens. However, accurately identifying disease causing organisms within these complex datasets remains a challenge due to genetic similarity among strains and the sheer diversity of microbial populations. This project addresses these limitations by improving strain level resolution, sensitivity, and specificity, ensuring reliable identification of pathogens in real world samples.

highlights



TransVi: A Transformer-based Web Application for Pathogenic Virus Identification


TransVi is a web-based classification model that leverages transformer architectures to predict pathogenic viruses directly from metagenomic sequences. To achieve this, the classification model is first pretrained with DNABERT on genomic sequences, capturing biological context and long range dependencies. The model is then fine-tuned with labelled viral data to enable high accuracy identification of strains such as SARS-CoV-1, MERS, SARS-CoV-2, Ebola, Dengue, and Influenza. TransVi demonstrates the power of large language models in resolving short read ambiguities and offers a scalable solution for real time pathogen surveillance and outbreak response.

MetaTrans: A Transformer-Integrated Web Application to Improve the Detection of Pathogenic Strains from Metagenomic Data


MetaTrans integrates both supervised and unsupervised techniques to improve strain level pathogen detection from metagenomic data. It employs large language models and operates across three phases: initial classification (Model: CLM), clustering of unlabeled data (Model: CLT), and retraining with enriched annotations (Model: CLM*). This iterative pipeline enhances sensitivity and contextual accuracy, enabling precise identification of viruses like SARS-CoV-1, MERS, and SARS-CoV-2. MetaTrans can support intelligent annotation for biomedical research and public health applications.

Publications



CODE



The algorithm is implemented in Python. The code and datasets are available in the following links.

Use of code/technique/algorithm is free as long as it is used for any academic and non-commercial purpose. If you use this code/technique/algorithm, please cite this work.

For any query regarding the algorithms, please mail to indrajit@nitttrkol.ac.in

Disclaimer:
The dataset is used from public database. Thus, NITTTR, Kolkata does not own any responsible for its accuracy.

MEMBERS



Avatar

Kathleen Marchal

Lead Principal Investigator

Avatar

Dr. Indrajit Saha

Lead Principal Investigator

Avatar

Debi Prasad Mishra

Co-Principal Investigator

Avatar

Jan Fostier

Co-Principal Investigator

Avatar

Sigrid De Keersmaecker

Co-Principal Investigator

Avatar

Priyasi Mallick

Junior Research Fellow