论文标题
自动分析可用的顶级人工智能会议论文的源代码
Automatic Analysis of Available Source Code of Top Artificial Intelligence Conference Papers
论文作者
论文摘要
源代码对于研究人员重现方法并复制人工智能(AI)论文结果至关重要。一些组织和研究人员手动收集具有可用源代码的AI论文,以对AI社区做出贡献。但是,手动收集是一项劳动密集型且耗时的任务。为了解决此问题,我们提出了一种方法,可以自动识别具有可用源代码的论文并提取其源代码存储库URL。通过这种方法,我们发现,从2010年到2019年发布的10个最高AI会议的常规论文中有20.5%被确定为具有可用源代码的论文,并且这些源代码存储库中有8.1%不再可访问。我们还创建了XMU NLP Lab ReadMe数据集,这是用于源代码文档研究的标签读数文件的最大数据集。通过此数据集,我们发现了很多读书文件没有提供的安装说明或使用教程。此外,对AI会议论文的源代码的一般图片进行了大规模的综合统计分析。提出的解决方案还可以超越AI会议论文,以分析来自期刊和会议的其他科学论文,以阐明更多领域。
Source code is essential for researchers to reproduce the methods and replicate the results of artificial intelligence (AI) papers. Some organizations and researchers manually collect AI papers with available source code to contribute to the AI community. However, manual collection is a labor-intensive and time-consuming task. To address this issue, we propose a method to automatically identify papers with available source code and extract their source code repository URLs. With this method, we find that 20.5% of regular papers of 10 top AI conferences published from 2010 to 2019 are identified as papers with available source code and that 8.1% of these source code repositories are no longer accessible. We also create the XMU NLP Lab README Dataset, the largest dataset of labeled README files for source code document research. Through this dataset, we have discovered that quite a few README files have no installation instructions or usage tutorials provided. Further, a large-scale comprehensive statistical analysis is made for a general picture of the source code of AI conference papers. The proposed solution can also go beyond AI conference papers to analyze other scientific papers from both journals and conferences to shed light on more domains.