Research in the group is focused on developing computational methods and tools for variant calling in human genomes using high-throughput DNA sequencing technologies. Although human genome sequencing has advanced rapidly, DNA sequencing using short Illumina reads provides an incomplete picture of human genetic variation and it is difficult to identify certain types of variants. We focus on challenging variant types such as haplotypes and variants in repetitive regions and work with both short-read (Illumina) and long-read sequencing technologies (Pacific Biosciences and Oxford Nanopore). Our goal is to develop computational tools that can be used to enable discovery of new disease-associated variants and analyze human genetic variation. Two major topics of current research are:

1. Methods for accurate variant calling in repetitive regions of the genomes: We are developing methods for the detection of variants in repetitive regions (segmental duplications) of the human genome using whole-genome Illumina sequencing as well as single-molecule long read sequencing technologies. This includes both copy number variants as well as small variants such as SNPs. These regions cover more than 5% of the genome and overlap more than 150 disease-associated genes such as SMN1 (spinal muscular atrophy), STRC (hearing loss) and PMS2 (lynch syndrome). Our group has recently developed Parascopy, a robust and accurate method for estimation of copy number of low-copy repeats in the human genome. This method was presented at the ISMB 2021 conference [slides PDF]. Our long-term goal is enable the routine analysis of such repeats using whole-genome sequence data and identify novel disease-associated variants in both rare and complex diseases.

2. Haplotype-based variant calling using long-read sequencing technologies: We have developed a number of computational tools (e.g. HapCUT2) for reconstructing haplotypes in individual genomes that work with diverse sequencing technologies including linked-reads and long-read sequencing. Long-read sequencing technologies have the potential to overcome some of the key limitations of short-read sequencing, particular in long repetitive regions of the human genome, but require the development of new algorithms. We have previously developed computational methods for variant calling (Longshot, Nature Communications 2019) and read mapping in segmental duplications (Duplomap, Nucleic Acids Research 2020) using long-read sequencing technologies. Our goal is to enable accurate variant calling using long read sequencing and haplotype-based probabilistic models, particularly in repetitive regions of the genome.

Selected publications:

Prodanov T, Bansal V. Robust and accurate estimation of paralog-specific copy number for duplicated genes using whole-genome sequencing Nature Communications, 2022
Edge P, Bansal V. Longshot: accurate variant calling in diploid genomes using single-molecule long read sequencing. Nature Communications, Oct 2019.
Bakhtiari M, Shleizer-Burko S, Gymrek M, Bansal V, Bafna V. Targeted Genotyping of Variable Number Tandem Repeats with adVNTR. Genome Research, October 2018.