论文标题
用于多语言新闻流的批量聚类
Batch Clustering for Multilingual News Streaming
论文作者
论文摘要
如今,数字新闻文章已被广泛使用,由各种编辑发表,并且经常用不同的语言编写。大量的不同和无组织的信息使人类阅读非常困难或几乎不可能。这导致需要能够将大量多语言新闻安排到故事中的算法。为此,我们扩展了有关主题检测和跟踪的先前工作,并提出了一种受新闻启发的新系统。我们每批处理文章,寻找单语的本地主题,然后将其跨越时间和语言链接。在这里,我们介绍了一种小说的“重播”策略,将单语主题与故事联系起来。此外,我们提出了使用Sbert创建跨语言故事的新的微调多语言嵌入。我们的系统在西班牙和德语新闻和跨语言的英语,西班牙和德语新闻的数据集以及跨语言的最先进的结果上给出了单一的最新结果。
Nowadays, digital news articles are widely available, published by various editors and often written in different languages. This large volume of diverse and unorganized information makes human reading very difficult or almost impossible. This leads to a need for algorithms able to arrange high amount of multilingual news into stories. To this purpose, we extend previous works on Topic Detection and Tracking, and propose a new system inspired from newsLens. We process articles per batch, looking for monolingual local topics which are then linked across time and languages. Here, we introduce a novel "replaying" strategy to link monolingual local topics into stories. Besides, we propose new fine tuned multilingual embedding using SBERT to create crosslingual stories. Our system gives monolingual state-of-the-art results on dataset of Spanish and German news and crosslingual state-of-the-art results on English, Spanish and German news.