论文标题

丢失空间标记

Lost in Space Marking

论文作者

Jacobs, Cassandra L., Pinter, Yuval

论文摘要

我们研究了早期训练子词令牌机构做出的决定,即是否应该是带有特殊标记的单词intitial令牌,或者是最终的标记。基于表面层面的效率和凝聚力以及形态学的覆盖范围,我们发现接受了预先识别的英语文本训练的umigram lm标记器更好地标记了单词至关重要的标志,而一个训练了一个从标记单词端培训的原始文本益处。我们的发现跨越了域。

We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one. Based on surface-level considerations of efficiency and cohesion, as well as morphological coverage, we find that a Unigram LM tokenizer trained on pre-tokenized English text is better off marking the word-initial token, while one trained on raw text benefits from marking word ends. Our findings generalize across domains.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源