Navigating bottlenecks and trade-offs in genomic data analysis

B Berger, YW Yu - Nature Reviews Genetics, 2023 - nature.com
Genome sequencing and analysis allow researchers to decode the functional information
hidden in DNA sequences as well as to study cell to cell variation within a cell population …

Sequence Alignment/Map format: a comprehensive review of approaches and applications

Y Liu, X Shen, Y Gong, Y Liu, B Song… - Briefings in …, 2023 - academic.oup.com
Abstract The Sequence Alignment/Map (SAM) format file is the text file used to record
alignment information. Alignment is the core of sequencing analysis, and downstream tasks …

Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space

D Kempa, T Kociumaka - 2023 IEEE 64th Annual Symposium …, 2023 - ieeexplore.ieee.org
The last two decades have witnessed a dramatic increase in the amount of highly repetitive
datasets consisting of sequential data (strings, texts). Processing these massive amounts of …

Representation of k-Mer Sets Using Spectrum-Preserving String Sets

A Rahman, P Medevedev - Journal of Computational Biology, 2021 - liebertpub.com
Given the popularity and elegance of k-mer-based tools, finding a space-efficient way to
represent a set of k-mers is important for improving the scalability of bioinformatics analyses …

Efficient DNA sequence compression with neural networks

M Silva, D Pratas, AJ Pinho - GigaScience, 2020 - academic.oup.com
Background The increasing production of genomic data has led to an intensified need for
models that can cope efficiently with the lossless compression of DNA sequences. Important …

[HTML][HTML] Enhancing metagenomic classification with compression-based features

JM Silva, JR Almeida - Artificial Intelligence in Medicine, 2024 - Elsevier
Metagenomics is a rapidly expanding field that uses next-generation sequencing technology
to analyze the genetic makeup of environmental samples. However, accurately identifying …

Disk compression of k-mer sets

A Rahman, R Chikhi, P Medvedev - Algorithms for Molecular Biology, 2021 - Springer
K-mer based methods have become prevalent in many areas of bioinformatics. In
applications such as database search, they often work with large multi-terabyte-sized …

A hybrid pipeline for reconstruction and analysis of viral genomes at multi-organ level

D Pratas, M Toppinen, L Pyöriä, K Hedman… - …, 2020 - academic.oup.com
Background Advances in sequencing technologies have enabled the characterization of
multiple microbial and host genomes, opening new frontiers of knowledge while kindling …

Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences

K Kryukov, MT Ueda, S Nakagawa, T Imanishi - GigaScience, 2020 - academic.oup.com
Background Nearly all molecular sequence databases currently use gzip for data
compression. Ongoing rapid accumulation of stored data calls for a more efficient …

How compression and approximation affect efficiency in string distance measures

A Ganesh, T Kociumaka, A Lincoln, B Saha - … of the 2022 Annual ACM-SIAM …, 2022 - SIAM
Real-world data often comes in compressed form. Analyzing compressed data directly
(without first decompressing it) can save space and time by orders of magnitude. In this …