Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of...- 学术资源搜索

Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art

SC Manekar, SR Sathe - Current genomics, 2019 - ingentaconnect.com

Current genomics, 2019•ingentaconnect.com

Background: In bioinformatics, estimation of k-mer abundance histograms or just
enumerating the number of unique k-mers and the number of singletons are desirable in
many genome sequence analysis applications. The applications include predicting genome
sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters
for analysis tools), repeat detection, sequencing coverage estimation, measuring
sequencing error rates, etc. Different methods for cardinality estimation in sequencing data …

Background

In bioinformatics, estimation of k-mer abundance histograms or just enumerating the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequencing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estimation in sequencing data have been developed in recent years.

Objective

In this article, we present a comparative assessment of the different k-mer frequency estimation programs (ntCard, KmerGenie, KmerStream and Khmer (abundance-dist-single.py and unique-kmers.py) to assess their relative merits and demerits.

Methods

Principally, the miscounts/error-rates of these tools are analyzed by rigorous experimental analysis for a varied range of k. We also present experimental results on runtime, scalability for larger datasets, memory, CPU utilization as well as parallelism of k-mer frequency estimation methods.

Results

The results indicate that ntCard is more accurate in estimating F0, f1 and full k-mer abundance histograms compared with other methods. ntCard is the fastest but it has more memory requirements compared to KmerGenie.

Conclusion

The results of this evaluation may serve as a roadmap to potential users and practitioners of streaming algorithms for estimating k-mer coverage frequencies, to assist them in identifying an appropriate method. Such results analysis also help researchers to discover remaining open research questions, effective combinations of existing techniques and possible avenues for future research.

ingentaconnect.com

展开收起

被引用次数：16 相关文章所有 8 个版本

以上显示的是最相近的搜索结果。查看全部搜索结果

高级搜索

QQ 群

Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art

引用