自然分配转移对问答模型的影响

论文标题

自然分配转移对问答模型的影响

The Effect of Natural Distribution Shift on Question Answering Models

论文作者

Miller, John, Krauth, Karl, Recht, Benjamin, Schmidt, Ludwig

论文摘要

我们为Stanford问题回答数据集（小队）构建了四个新的测试集，并评估了提问系统推广到新数据的能力。我们的第一个测试集来自原始的Wikipedia域，并测量现有系统过于拟合原始测试集的程度。尽管重复使用了数年的重度测试，但我们没有发现适应性过高的证据。其余三个测试集由《纽约时报》文章，Reddit帖子和亚马逊产品评论构建，并衡量自然分配转移的鲁棒性。在广泛的模型中，我们观察到平均性能下降分别为3.8、14.0和17.4 F1点。相比之下，强烈的人类基线匹配或超过了原始域上小队模型的性能，并且在新领域几乎没有下降。综上所述，我们的结果证实了保留方法的令人惊讶的弹性，并强调需要朝着评估指标迈向自然分布转移的鲁棒性。

We build four new test sets for the Stanford Question Answering Dataset (SQuAD) and evaluate the ability of question-answering systems to generalize to new data. Our first test set is from the original Wikipedia domain and measures the extent to which existing systems overfit the original test set. Despite several years of heavy test set re-use, we find no evidence of adaptive overfitting. The remaining three test sets are constructed from New York Times articles, Reddit posts, and Amazon product reviews and measure robustness to natural distribution shifts. Across a broad range of models, we observe average performance drops of 3.8, 14.0, and 17.4 F1 points, respectively. In contrast, a strong human baseline matches or exceeds the performance of SQuAD models on the original domain and exhibits little to no drop in new domains. Taken together, our results confirm the surprising resilience of the holdout method and emphasize the need to move towards evaluation metrics that incorporate robustness to natural distribution shifts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题