论文标题
带有Gflownets的生物序列设计
Biological Sequence Design with GFlowNets
论文作者
论文摘要
具有所需特性(如蛋白质和DNA序列)的从头生物序列的设计通常涉及一个有几个回合分子构想和昂贵的湿LAB评估的活动环。这些实验可以由多个阶段组成,并提高了候选者过滤的精度和评估成本的水平。这使得拟议的候选人的多样性成为构想阶段的关键考虑因素。在这项工作中,我们提出了一种积极的学习算法,利用了认知不确定性估计,而最近提出的GFLOWNETS是多种候选解决方案的生成器,目的是获得多种有用的有用(例如,由某些效用功能定义,例如,例如,预测的抗肽的抗生素候选者)和随后的候选者。我们还提出了一项计划,除了奖励功能外,还将现有标记的候选数据集合并,以加快Gflownets中的学习。我们对几个生物序列设计任务提出了经验结果,并且与现有方法相比,我们的方法与评分候选者相比产生更多样化和新颖的批次。
Design of de novo biological sequences with desired properties, like protein and DNA sequences, often involves an active loop with several rounds of molecule ideation and expensive wet-lab evaluations. These experiments can consist of multiple stages, with increasing levels of precision and cost of evaluation, where candidates are filtered. This makes the diversity of proposed candidates a key consideration in the ideation phase. In this work, we propose an active learning algorithm leveraging epistemic uncertainty estimation and the recently proposed GFlowNets as a generator of diverse candidate solutions, with the objective to obtain a diverse batch of useful (as defined by some utility function, for example, the predicted anti-microbial activity of a peptide) and informative candidates after each round. We also propose a scheme to incorporate existing labeled datasets of candidates, in addition to a reward function, to speed up learning in GFlowNets. We present empirical results on several biological sequence design tasks, and we find that our method generates more diverse and novel batches with high scoring candidates compared to existing approaches.