从数字化书籍中检索信息

论文标题

从数字化书籍中检索信息

Information Retrieval from the Digitized Books

论文作者

Gupta, Riya, Jawahar, C. V.

论文摘要

从大量文档中提取相关信息是一项具有挑战性且繁琐的任务。传统上可用的全文搜索引擎和基于文本的图像检索系统产生的结果质量并不是最佳的。通过非传统语言脚本，信息检索（IR）任务变得更加具有挑战性，如指示脚本而言。作者开发了OCR（光学特征识别）搜索引擎，以制造信息检索和提取系统（IRE）系统，该系统使用IRE和自然语言处理（NLP）技术复制当前的最新方法。在这里，我们介绍了用于执行搜索和检索任务的方法的研究。还提供了该系统的详细信息，以及数据集的统计数据（来源：印度国家数字图书馆或NDLI）。此外，还讨论了进一步探索和增加IRE研究价值的想法。

Extracting the relevant information out of a large number of documents is a challenging and tedious task. The quality of results generated by the traditionally available full-text search engine and text-based image retrieval systems is not optimal. Information retrieval (IR) tasks become more challenging with the nontraditional language scripts, as in the case of Indic scripts. The authors have developed OCR (Optical Character Recognition) Search Engine to make an Information Retrieval & Extraction (IRE) system that replicates the current state-of-the-art methods using the IRE and Natural Language Processing (NLP) techniques. Here we have presented the study of the methods used for performing search and retrieval tasks. The details of this system, along with the statistics of the dataset (source: National Digital Library of India or NDLI), is also presented. Additionally, the ideas to further explore and add value to research in IRE are also discussed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题