作者
Maxwell W Libbrecht, Jeffrey A Bilmes, William Stafford Noble
发表日期
2018/4
期刊
Proteins: Structure, Function, and Bioinformatics
卷号
86
期号
4
页码范围
454-466
简介
Selecting a non‐redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non‐redundant training sets for sequence and structural models or selection of “operational taxonomic units” from metagenomics data. Previous methods for this task, such as CD‐HIT, PISCES, and UCLUST, apply a heuristic threshold‐based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this …
引用总数
20182019202020212022202320242365324
学术搜索中的文章