Motivation: Insertions play an important role in genome evolution. However, such variants are difficult to detect from short-read sequencing data, especially when they exceed the paired-end insert size. Many approaches have been proposed to call short insertion variants based on paired-end mapping. However, there remains a lack of practical methods to detect and assemble long variants.
Results: We propose here an original method, called M ind T he G ap , for the integrated detection and assembly of insertion variants from re-sequencing data. Importantly, it is designed to call insertions of any size, whether they are novel or duplicated, homozygous or heterozygous in the donor genome. M ind T he G ap uses an efficient k -mer-based method to detect insertion sites in a reference genome, and subsequently assemble them from the donor reads. M ind T he G ap showed high recall and precision on simulated datasets of various genome complexities. When applied to real Caenorhabditis elegans and human NA12878 datasets, M ind T he G ap detected and correctly assembled insertions >1 kb, using at most 14 GB of memory.
Availability and implementation: http://mindthegap.genouest.org
Contact: guillaume.rizk@inria.fr or claire.lemaitre@inria.fr