通过序列标记框架将物种信息分配给相应基因

论文标题

通过序列标记框架将物种信息分配给相应基因

Assigning Species Information to Corresponding Genes by a Sequence Labeling Framework

论文作者

Luo, Ling, Wei, Chih-Hsuan, Lai, Po-Ting, Chen, Qingyu, Doğan, Rezarta Islamaj, Lu, Zhiyong

论文摘要

在研究文章中，将物种信息自动分配给相应的基因是基因归一化任务的至关重要的一步，在基因归一化任务中，基因提及通过文本挖掘算法将基因提及归一化并链接到数据库记录或标识符。现有方法通常依赖于基于基因和物种共发生的启发式规则，但其准确性是次优的。因此，我们使用一种新型的基于深度学习的框架开发了一种高性能方法，以对基因和物种之间的关系进行分类。我们将问题视为一项序列标记的任务，而不是评估同一文章中所有可能的基因和物种对的传统二元分类框架，因此只需要考虑这对的一小部分。我们的基准测试结果表明，与物种分配任务的基于规则的基线方法相比，我们的方法的性能明显更高（准确性的65.8％至81.3％）。物种分配的源代码和数据可在https://github.com/ncbi/speciesassignment上免费获得。

The automatic assignment of species information to the corresponding genes in a research article is a critically important step in the gene normalization task, whereby a gene mention is normalized and linked to a database record or identifier by a text-mining algorithm. Existing methods typically rely on heuristic rules based on gene and species co-occurrence in the article, but their accuracy is suboptimal. We therefore developed a high-performance method, using a novel deep learning-based framework, to classify whether there is a relation between a gene and a species. Instead of the traditional binary classification framework in which all possible pairs of genes and species in the same article are evaluated, we treat the problem as a sequence-labeling task such that only a fraction of the pairs needs to be considered. Our benchmarking results show that our approach obtains significantly higher performance compared to that of the rule-based baseline method for the species assignment task (from 65.8% to 81.3% in accuracy). The source code and data for species assignment are freely available at https://github.com/ncbi/SpeciesAssignment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题