Mappability and read length

W Li, J Freudenberg - Frontiers in genetics, 2014 - frontiersin.org
Power-law distributions are the main functional form for the distribution of repeat size and
repeat copy number in the human genome. When the genome is broken into fragments for …

The completed genome sequence of Pestalotiopsis versicolor, a pathogenic ascomycete fungus with implications for bayberry production

J Guo, H Ren, M Ijaz, X Qi, T Ahmed, Y You, G Li, Z Yu… - Genomics, 2023 - Elsevier
The pathogenic fungus Pestalotiopsis versicolor is a major etiological agent of fungal twig
blight disease affecting bayberry trees. However, the lack of complete genome sequence …

Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree

MO Kulekci, JS Vitter, B Xu - IEEE/ACM Transactions on …, 2011 - ieeexplore.ieee.org
Finding repetitive structures in genomes and proteins is important to understand their
biological functions. Many data compressors for modern genomic sequences rely heavily on …

Parallel motif extraction from very long sequences

M Sahli, E Mansour, P Kalnis - Proceedings of the 22nd ACM …, 2013 - dl.acm.org
Motifs are frequent patterns used to identify biological functionality in genomic sequences,
periodicity in time series, or user trends in web logs. In contrast to a lot of existing work that …

ACME: A scalable parallel system for extracting frequent patterns from a very long sequence

M Sahli, E Mansour, P Kalnis - The VLDB Journal, 2014 - Springer
Modern applications, including bioinformatics, time series, and web log analysis, require the
extraction of frequent patterns, called motifs, from one very long (ie, several gigabytes) …

Characterizing regions in the human genome unmappable by next-generation-sequencing at the read length of 1000 bases

W Li, J Freudenberg - Computational biology and chemistry, 2014 - Elsevier
Repetitive and redundant regions of a genome are particularly problematic for mapping
sequencing reads. In the present paper, we compile a list of the unmappable regions in the …

R-enum: Enumeration of characteristic substrings in BWT-runs bounded space

T Nishimoto, Y Tabei - arXiv preprint arXiv:2004.01493, 2020 - arxiv.org
Enumerating characteristic substrings (eg, maximal repeats, minimal unique substrings, and
minimal absent words) in a given string has been an important research topic because there …

Space-efficient computation of maximal and supermaximal repeats in genome sequences

T Beller, K Berger, E Ohlebusch - … de Indias, Colombia, October 21-25 …, 2012 - Springer
The identification of repetitive sequences (repeats) is an essential component of genome
sequence analysis, and the notions of maximal and supermaximal repeats capture all exact …

Hardness Results on Characteristics for Elastic-Degenerated Strings

D Köppl, J Olbrich - arXiv preprint arXiv:2411.10653, 2024 - arxiv.org
Generalizations of plain strings have been proposed as a compact way to represent a
collection of nearly identical sequences or to express uncertainty at specific text positions by …

A genomic distance for assembly comparison based on compressed maximal exact matches

SP Garcia, JMOS Rodrigues, S Santos… - IEEE/ACM …, 2013 - ieeexplore.ieee.org
Genome assemblies are typically compared with respect to their contiguity, coverage, and
accuracy. We propose a genome-wide, alignment-free genomic distance based on …