论文标题

DNA调节底物的结构表示可以通过关联功能序列变体来增强基于序列的算法

Structural representations of DNA regulatory substrates can enhance sequence-based algorithms by associating functional sequence variants

论文作者

Zrimec, Jan

论文摘要

DNA的核苷酸序列表示不足以解决蛋白-DNA结合位点和调节底物,例如涉及基因表达和水平基因转移的底物。考虑到类似序列的表示在算法上非常有用,在这里,我们将当前可用的DNA物理化学和构象变量融合到紧凑的结构表示中,可以将单个DNA结合位点编码为整个调节区域。我们发现,主要的结构成分反映了蛋白-DNA相互作用的关键特性,并且可以凝结到单个核苷酸位置中的信息量。最准确的结构表示将功能性DNA序列变体压缩30%至50%,因为每个实例从数十到数千个序列编码。我们表明,与基于核苷酸序列的指标相比,结构距离函数更准确地区分了DNA底物的组。当这打开了各种实施可能性时,我们开发和测试了基于距离的对齐算法,这表明了使用结构表示来增强基于序列的算法的潜力。由于大多数当前生物信息学方法对核苷酸序列表示的偏见,因此,这种溶液可能仍然可以达到可观的性能提高。

The nucleotide sequence representation of DNA can be inadequate for resolving protein-DNA binding sites and regulatory substrates, such as those involved in gene expression and horizontal gene transfer. Considering that sequence-like representations are algorithmically very useful, here we fused over 60 currently available DNA physicochemical and conformational variables into compact structural representations that can encode single DNA binding sites to whole regulatory regions. We find that the main structural components reflect key properties of protein-DNA interactions and can be condensed to the amount of information found in a single nucleotide position. The most accurate structural representations compress functional DNA sequence variants by 30% to 50%, as each instance encodes from tens to thousands of sequences. We show that a structural distance function discriminates among groups of DNA substrates more accurately than nucleotide sequence-based metrics. As this opens up a variety of implementation possibilities, we develop and test a distance-based alignment algorithm, demonstrating the potential of using the structural representations to enhance sequence-based algorithms. Due to the bias of most current bioinformatic methods to nucleotide sequence representations, it is possible that considerable performance increases might still be achievable with such solutions.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源