论文标题

大规模主题类别的学术论文分类具有深刻的细心神经网络

Large Scale Subject Category Classification of Scholarly Papers with Deep Attentive Neural Networks

论文作者

Kandimalla, Bharath, Rohatgi, Shaurya, Wu, Jian, Giles, C Lee

论文摘要

学术论文的主题类别通常是指论文所属的知识领域,例如计算机科学或物理学。主题类别信息可用于构建数字图书馆搜索引擎的刻面搜索。这可以大大帮助用户缩小相关文档的搜索空间。不幸的是,许多学术论文没有作为其元数据的一部分信息。解决此任务的现有方法通常集中于通常依赖引用网络的无监督学习。但是,列出了当前论文的完整列表,可能不容易获得。特别是,不可能使用这种方法对几乎没有引用或没有引用的新论文进行分类。在这里,我们提出了一个深切的细心神经网络(DANN),该神经网络仅使用其摘要对学术论文进行分类。该网络使用来自Web of Science(WOS)的900万摘要进行培训。我们还使用涵盖104个主题类别的WOS模式。所提出的网络由两个双向复发性神经网络组成,然后是注意力层。我们通过改变体系结构和文本表示形式将模型与基线进行比较。我们的最佳模型可实现0.76的Micro-F1度量,而单个受试者类别的F1范围为0.50-0.95。结果表明,重新培养单词嵌入模型以最大化词汇重叠和注意机制的有效性的重要性。单词向量与TFIDF的组合优于字符和句子级别嵌入模型。我们讨论了不平衡的样本和重叠类别,并提出了缓解的可能策略。我们还通过对100万个学术论文的随机样本进行分类来确定Citeseerx中的主题类别分布。

Subject categories of scholarly papers generally refer to the knowledge domain(s) to which the papers belong, examples being computer science or physics. Subject category information can be used for building faceted search for digital library search engines. This can significantly assist users in narrowing down their search space of relevant documents. Unfortunately, many academic papers do not have such information as part of their metadata. Existing methods for solving this task usually focus on unsupervised learning that often relies on citation networks. However, a complete list of papers citing the current paper may not be readily available. In particular, new papers that have few or no citations cannot be classified using such methods. Here, we propose a deep attentive neural network (DANN) that classifies scholarly papers using only their abstracts. The network is trained using 9 million abstracts from Web of Science (WoS). We also use the WoS schema that covers 104 subject categories. The proposed network consists of two bi-directional recurrent neural networks followed by an attention layer. We compare our model against baselines by varying the architecture and text representation. Our best model achieves micro-F1 measure of 0.76 with F1 of individual subject categories ranging from 0.50-0.95. The results showed the importance of retraining word embedding models to maximize the vocabulary overlap and the effectiveness of the attention mechanism. The combination of word vectors with TFIDF outperforms character and sentence level embedding models. We discuss imbalanced samples and overlapping categories and suggest possible strategies for mitigation. We also determine the subject category distribution in CiteSeerX by classifying a random sample of one million academic papers.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源