Data structures based on k-mers for querying large collections of sequencing data sets

C Marchet, C Boucher, SJ Puglisi, P Medvedev… - Genome …, 2021 - genome.cshlp.org
High-throughput sequencing data sets are usually deposited in public repositories (eg, the
European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached …

The design and construction of reference pangenome graphs with minigraph

H Li, X Feng, C Chu - Genome biology, 2020 - Springer
The recent advances in sequencing technologies enable the assembly of individual
genomes to the quality of the reference genome. How to integrate multiple genomes from …

Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era

R Rizzi, S Beretta, M Patterson, Y Pirola, M Previtali… - Quantitative …, 2019 - Springer
Background De novo genome assembly relies on two kinds of graphs: de Bruijn graphs and
overlap graphs. Overlap graphs are the basis for the Celera assembler, while de Bruijn …

Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs

G Holley, P Melsted - Genome biology, 2020 - Springer
Memory consumption of de Bruijn graphs is often prohibitive. Most de Bruijn graph-based
assemblers reduce the complexity by compacting paths into single vertices, but this is …

Succinct de Bruijn graphs

A Bowe, T Onodera, K Sadakane, T Shibuya - International workshop on …, 2012 - Springer
We propose a new succinct de Bruijn graph representation. If the de Bruijn graph of k-mers
in a DNA sequence of length N has m edges, it can be represented in 4 m+ o (m) bits. This is …

Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT

A Cracco, AI Tomescu - Genome Research, 2023 - genome.cshlp.org
Compacted de Bruijn graphs are one of the most fundamental data structures in
computational genomics. Colored compacted de Bruijn graphs are a variant built on a …

Metagraph: Indexing and analysing nucleotide archives at petabase-scale

M Karasikov, H Mustafa, D Danciu, C Barber… - BioRxiv, 2020 - biorxiv.org
The amount of biological sequencing data available in public repositories is growing
exponentially, forming an invaluable biomedical research resource. Yet, making all this …

Conway–Bromage–Lyndon (CBL): an exact, dynamic representation of k-mer sets

I Martayan, B Cazaux, A Limasset, C Marchet - Bioinformatics, 2024 - academic.oup.com
In this article, we introduce the Conway–Bromage–Lyndon (CBL) structure, a compressed,
dynamic and exact method for representing k-mer sets. Originating from Conway and …

Kmtricks: efficient and flexible construction of bloom filters for large sequencing data collections

T Lemane, P Medvedev, R Chikhi… - Bioinformatics …, 2022 - academic.oup.com
When indexing large collections of short-read sequencing data, a common operation that
has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is …

Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections

J Khan, R Patro - Bioinformatics, 2021 - academic.oup.com
Motivation The construction of the compacted de Bruijn graph from collections of reference
genomes is a task of increasing interest in genomic analyses. These graphs are increasingly …