通用和域自适应中国拼写检查，并置于错误的训练

论文标题

通用和域自适应中国拼写检查，并置于错误的训练

General and Domain Adaptive Chinese Spelling Check with Error Consistent Pretraining

论文作者

Lv, Qi, Cao, Ziqiang, Geng, Lei, Ai, Chunhui, Yan, Xu, Fu, Guohong

论文摘要

缺乏标签数据是中国拼写检查（CSC）的重要瓶颈之一。现有研究通过利用未标记的数据来扩展监督语料库来使用自动生成的方法。但是，实际输入方案和自动生成的语料库之间存在很大的差距。因此，我们开发了一种竞争性的拼写式Ecspell，该拼写会采用错误一致的掩盖策略来创建预处理的数据。此错误一致性掩蔽策略用于指定与真实场景一致的自动生成句子的错误类型。实验结果表明，我们的模型在一般基准测试上优于先前的最新模型。此外，拼写者通常在现实生活中的特定领域中起作用。由于许多罕见的域术语，我们内置域的特定数据集中的实验表明，通用模型的性能非常出色。受输入方法的共同实践的启发，我们建议添加可更可替代的用户词典来处理零摄像域的自适应问题。具体来说，我们将用户字典指导模块（UD）附加到基于令牌分类的拼写器上。我们的实验表明，ecspell $^{ud} $，即ecspell与UD相结合，超过了所有其他基线，甚至可以在一般基准上接近性能。

The lack of label data is one of the significant bottlenecks for Chinese Spelling Check (CSC). Existing researches use the method of automatic generation by exploiting unlabeled data to expand the supervised corpus. However, there is a big gap between the real input scenario and automatic generated corpus. Thus, we develop a competitive general speller ECSpell which adopts the Error Consistent masking strategy to create data for pretraining. This error consistency masking strategy is used to specify the error types of automatically generated sentences which is consistent with real scene. The experimental result indicates our model outperforms previous state-of-the-art models on the general benchmark. Moreover, spellers often work within a particular domain in real life. Due to lots of uncommon domain terms, experiments on our built domain specific datasets show that general models perform terribly. Inspired by the common practice of input methods, we propose to add an alterable user dictionary to handle the zero-shot domain adaption problem. Specifically, we attach a User Dictionary guided inference module (UD) to a general token classification based speller. Our experiments demonstrate that ECSpell$^{UD}$, namely ECSpell combined with UD, surpasses all the other baselines largely, even approaching the performance on the general benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题