论文标题
图书馆员:一种自然语言处理范式,用于检测非正式的学术文献研究数据
Librarian-in-the-Loop: A Natural Language Processing Paradigm for Detecting Informal Mentions of Research Data in Academic Literature
论文作者
论文摘要
数据引用为研究研究数据影响提供了基础。收集和管理数据引用是档案科学和学术交流的新领域。但是,研究数据引用的发现和策划是劳动密集型的。很容易找到引用唯一标识符(即dois)的数据引用;但是,对研究数据的非正式提及更具挑战性。我们提出一种自然语言处理(NLP)范式,以支持人类识别对研究数据集的非正式提及的任务。目前,图书馆员及其员工在政治和社会研究联盟(ICPSR)(ICPSR)中,发现非正式数据提及的工作是一个大型的社会科学数据档案,该档案保持了与数据相关文献的大量参考书目。 NLP模型是由图书馆员在ICPSR上积极收集的数据引用。该模型将模式匹配与人类注释的多次迭代结合在一起,以学习检测非正式数据提及的其他规则。然后,这些示例用于训练NLP管道。图书馆员范式范式集中在ICPSR图书馆员进行的数据工作中,支持更广泛的努力,以建立更全面的数据相关文献参考书目,以反映研究数据使用者的学术社区。
Data citations provide a foundation for studying research data impact. Collecting and managing data citations is a new frontier in archival science and scholarly communication. However, the discovery and curation of research data citations is labor intensive. Data citations that reference unique identifiers (i.e. DOIs) are readily findable; however, informal mentions made to research data are more challenging to infer. We propose a natural language processing (NLP) paradigm to support the human task of identifying informal mentions made to research datasets. The work of discovering informal data mentions is currently performed by librarians and their staff in the Inter-university Consortium for Political and Social Research (ICPSR), a large social science data archive that maintains a large bibliography of data-related literature. The NLP model is bootstrapped from data citations actively collected by librarians at ICPSR. The model combines pattern matching with multiple iterations of human annotations to learn additional rules for detecting informal data mentions. These examples are then used to train an NLP pipeline. The librarian-in-the-loop paradigm is centered in the data work performed by ICPSR librarians, supporting broader efforts to build a more comprehensive bibliography of data-related literature that reflects the scholarly communities of research data users.