作者
Erik Nijkamp, Jeffrey A Ruffolo, Eli N Weinstein, Nikhil Naik, Ali Madani
发表日期
2022/9/29
期刊
Cell Systems
出版商
Elsevier
简介
Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a …
引用总数
学术搜索中的文章
E Nijkamp, JA Ruffolo, EN Weinstein, N Naik, A Madani - Cell systems, 2023