提取：蛋白质适应性预测，自回旋变压器和推理时间检索

论文标题

提取：蛋白质适应性预测，自回旋变压器和推理时间检索

Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval

论文作者

Notin, Pascal, Dias, Mafalda, Frazer, Jonathan, Marchena-Hurtado, Javier, Gomez, Aidan, Marks, Debora S., Gal, Yarin

论文摘要

从量化人类变异对疾病可能性的影响到预测病毒中的免疫渗透突变以及设计新型的生物治疗蛋白的能力，对蛋白质序列的适应性格局进行准确模拟蛋白质序列的适应性景观至关重要。迄今为止，在多个序列比对训练的蛋白质序列的深层生成模型一直是解决这些任务的最成功方法。但是，这些方法的性能取决于可靠的足够深层和多样的一致性以进行可靠的培训。因此，它们的潜在范围受到许多蛋白质家族的限制，即使不是不可能，也很难对齐。大量培训的大型语言模型来自不同家族的大量非对准蛋白序列解决了这些问题，并显示出最终弥合性能差距的潜力。我们介绍了Tranceing，这是一种新型的变压器结构，利用自回归预测并在推理时检索同源序列，以实现最新的健身预测性能。鉴于其在多个突变体上的性能明显更高，对浅对准的稳健性和评分indels的能力，我们的方法比现有方法具有显着的范围。为了启用更广泛的蛋白质家族的更严格的模型测试，我们开发了蛋白酶 - 与现有基准相比，一组广泛的多重分析测定法，大大增加了测定的数量和多样性。

The ability to accurately model the fitness landscape of protein sequences is critical to a wide range of applications, from quantifying the effects of human variants on disease likelihood, to predicting immune-escape mutations in viruses and designing novel biotherapeutic proteins. Deep generative models of protein sequences trained on multiple sequence alignments have been the most successful approaches so far to address these tasks. The performance of these methods is however contingent on the availability of sufficiently deep and diverse alignments for reliable training. Their potential scope is thus limited by the fact many protein families are hard, if not impossible, to align. Large language models trained on massive quantities of non-aligned protein sequences from diverse families address these problems and show potential to eventually bridge the performance gap. We introduce Tranception, a novel transformer architecture leveraging autoregressive predictions and retrieval of homologous sequences at inference to achieve state-of-the-art fitness prediction performance. Given its markedly higher performance on multiple mutants, robustness to shallow alignments and ability to score indels, our approach offers significant gain of scope over existing approaches. To enable more rigorous model testing across a broader range of protein families, we develop ProteinGym -- an extensive set of multiplexed assays of variant effects, substantially increasing both the number and diversity of assays compared to existing benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题