Choosing non‐redundant representative subsets of protein sequence data sets using submodular optimization

MW Libbrecht, JA Bilmes… - … : Structure, Function, and …, 2018 - Wiley Online Library
Selecting a non‐redundant representative subset of sequences is a common step in many
bioinformatics workflows, such as the creation of non‐redundant training sets for sequence …

MinSet: a general approach to derive maximally representative database subsets by using fragment dictionaries and its application to the SCOP database

A Pandini, L Bonati, F Fraternali, J Kleinjung - Bioinformatics, 2007 - academic.oup.com
Motivation: The size of current protein databases is a challenge for many Bioinformatics
applications, both in terms of processing speed and information redundancy. It may be …

Optimal classification of protein sequences and selection of representative sets from multiple alignments: application to homologous families and lessons for structural …

ACW May - Protein engineering, 2001 - academic.oup.com
Hierarchical classification is probably the most popular approach to group-related proteins.
However, there are a number of problems associated with its use for this purpose. One is …

[PDF][PDF] Seq2MSA: A Language Model for Protein Sequence Diversification

P Sturmfels, R Rao, R Verkuil, Z Lin, O Kabeli… - Machine Learning in …, 2022 - mlsb.io
Diversification libraries of protein sequences that contain a similar set of structures over a
variety of sequences can help protein design pipelines by introducing flexibility into the …

Identifying functionally informative evolutionary sequence profiles

N Gil, A Fiser - Bioinformatics, 2018 - academic.oup.com
Abstract Motivation Multiple sequence alignments (MSAs) can provide essential input to
many bioinformatics applications, including protein structure prediction and functional …

Selecting the right similarity‐scoring matrix

WR Pearson - Current protocols in bioinformatics, 2013 - Wiley Online Library
Protein sequence similarity searching programs like BLASTP, SSEARCH, and FASTA use
scoring matrices that are designed to identify distant evolutionary relationships (BLOSUM62 …

Selecting the right protein‐scoring matrix

D Wheeler - Current Protocols in Bioinformatics, 2003 - Wiley Online Library
Every program for searching protein sequences against a database includes a choice of a
protein weight matrix, also called a scoring matrix. Weight matrices add sensitivity to the …

IIFS: An improved incremental feature selection method for protein sequence processing

C Meng, Y Yuan, H Zhao, Y Pei, Z Li - Computers in Biology and Medicine, 2023 - Elsevier
Motivation Discrete features can be obtained from protein sequences using a feature
extraction method. These features are the basis of downstream processing of protein data …

Partitioning and correlating subgroup characteristics from Aligned Pattern Clusters

ESA Lee, FJ Whelan, DME Bowdish… - Bioinformatics, 2016 - academic.oup.com
Motivation: Evolutionarily conserved amino acids within proteins characterize functional or
structural regions. Conversely, less conserved amino acids within these regions are …

Rapid search for tertiary fragments reveals protein sequence–structure relationships

J Zhou, G Grigoryan - Protein Science, 2015 - Wiley Online Library
Finding backbone substructures from the Protein Data Bank that match an arbitrary query
structural motif, composed of multiple disjoint segments, is a problem of growing relevance …