论文标题
历史文档处理:历史文档处理:技术,工具和趋势的调查
Historical Document Processing: Historical Document Processing: A Survey of Techniques, Tools, and Trends
论文作者
论文摘要
历史文档处理是将过去的书面材料数字化的过程,以供历史学家和其他学者使用。它结合了来自计算机科学各个子场的算法和软件工具,包括计算机视觉,文档分析和识别,自然语言处理和机器学习,以将古代手稿,信件,日记和早期印刷文本的图像自动转换为可在数据挖掘和信息检索系统中使用的数字格式。在过去的二十年中,随着图书馆,博物馆和其他文化遗产机构扫描了越来越多的历史文档档案档案,因此需要从这些藏书中抄录全文的需求变得很敏锐。由于历史文档处理涵盖了计算机科学的多个子域,因此与其目的相关的知识分散在众多期刊和会议上。本文调查了历史文档处理领域中标准算法,工具和数据集的主要阶段,讨论了文献综述的结果,并最终提出了进一步研究的方向。
Historical Document Processing is the process of digitizing written material from the past for future use by historians and other scholars. It incorporates algorithms and software tools from various subfields of computer science, including computer vision, document analysis and recognition, natural language processing, and machine learning, to convert images of ancient manuscripts, letters, diaries, and early printed texts automatically into a digital format usable in data mining and information retrieval systems. Within the past twenty years, as libraries, museums, and other cultural heritage institutions have scanned an increasing volume of their historical document archives, the need to transcribe the full text from these collections has become acute. Since Historical Document Processing encompasses multiple sub-domains of computer science, knowledge relevant to its purpose is scattered across numerous journals and conference proceedings. This paper surveys the major phases of, standard algorithms, tools, and datasets in the field of Historical Document Processing, discusses the results of a literature review, and finally suggests directions for further research.