论文标题
可解释性的预处理奖励功能
Preprocessing Reward Functions for Interpretability
论文作者
论文摘要
在许多实际应用中,奖励功能太复杂,无法手动指定。在这种情况下,必须从人类反馈中学习奖励功能。由于学习的奖励可能无法代表用户的偏好,因此能够在部署前验证学习的奖励功能很重要。一种有希望的方法是将可解释性工具应用于奖励功能,以发现潜在的偏差与用户意图。现有的工作应用了通用的可解释性工具来了解学习的奖励功能。我们建议通过将奖励函数的固有结构预先处理为简单但等效的奖励函数,然后将其视为可视化。我们引入了一个通用框架,用于此类奖励预处理并提出混凝土预处理算法。我们的经验评估表明,预处理的奖励通常比原始奖励更容易理解。
In many real-world applications, the reward function is too complex to be manually specified. In such cases, reward functions must instead be learned from human feedback. Since the learned reward may fail to represent user preferences, it is important to be able to validate the learned reward function prior to deployment. One promising approach is to apply interpretability tools to the reward function to spot potential deviations from the user's intention. Existing work has applied general-purpose interpretability tools to understand learned reward functions. We propose exploiting the intrinsic structure of reward functions by first preprocessing them into simpler but equivalent reward functions, which are then visualized. We introduce a general framework for such reward preprocessing and propose concrete preprocessing algorithms. Our empirical evaluation shows that preprocessed rewards are often significantly easier to understand than the original reward.