通过人类认知偏见捕获大语模型的失败

论文标题

通过人类认知偏见捕获大语模型的失败

Capturing Failures of Large Language Models via Human Cognitive Biases

论文作者

Jones, Erik, Steinhardt, Jacob

论文摘要

大型语言模型会产生复杂的开放式输出：而不是输出摘要的类标签，而是生成对话或产生工作代码。为了评估这些开放式生成系统的可靠性，我们旨在确定错误行为的定性类别，而不是确定单个错误。为了假设和测试此类定性错误，我们从人类认知偏见中汲取灵感 - 与理性判断的系统偏差。具体而言，我们使用认知偏见作为（i）对模型可能存在的问题产生假设，以及（ii）开发引起这些问题的实验。将代码生成作为案例研究，我们发现OpenAI的法典错误是根据输入提示的方式，将输出调整到锚点的方式，并偏向于模仿频繁训练示例的输出。然后，我们使用框架来引起高影响错误，例如错误删除文件。我们的结果表明，来自认知科学的实验方法可以帮助表征机器学习系统的行为方式。

Large language models generate complex, open-ended outputs: instead of outputting a class label they write summaries, generate dialogue, or produce working code. In order to asses the reliability of these open-ended generation systems, we aim to identify qualitative categories of erroneous behavior, beyond identifying individual errors. To hypothesize and test for such qualitative errors, we draw inspiration from human cognitive biases -- systematic patterns of deviation from rational judgement. Specifically, we use cognitive biases as motivation to (i) generate hypotheses for problems that models may have, and (ii) develop experiments that elicit these problems. Using code generation as a case study, we find that OpenAI's Codex errs predictably based on how the input prompt is framed, adjusts outputs towards anchors, and is biased towards outputs that mimic frequent training examples. We then use our framework to elicit high-impact errors such as incorrectly deleting files. Our results indicate that experimental methodology from cognitive science can help characterize how machine learning systems behave.

下载PDF全文

下载文献需遵守相关版权规定

论文标题