超越回报：在用户指定错误测量分布下进行的非政策函数估计

论文标题

超越回报：在用户指定错误测量分布下进行的非政策函数估计

Beyond the Return: Off-policy Function Estimation under User-specified Error-measuring Distributions

论文作者

Huang, Audrey, Jiang, Nan

论文摘要

非政策评估通常是指两个相关的任务：估计策略的预期回报并估算其价值函数（或其他感兴趣的功能，例如密度比）。尽管最近关于边缘化重要性采样（MIS）的作品表明，前者可以在可实现的功能近似下享有可证明的保证，但后者仅在更强的假设（例如诸如表达性的歧视者）之类的更强的假设下是可行的。在这项工作中，我们通过将适当的正规化对错误的目标施加适当的正则化来保证仅在可实现的情况下进行非政策功能估计。与MIS中常用的正规化相比，我们的正常化程序更加灵活，并且可以说明任意用户指定的分布，在该分布下，学到的功能将接近地面图。我们提供了最佳双重解决方案的精确表征，该解决方案需要由歧视类别类实现，该解决方案确定了价值功能学习的情况下的数据覆盖假设。作为另一个令人惊讶的观察，可以更改正规器以放大数据覆盖要求，并在理想情况下完全消除具有强大侧面信息的理想情况。

Off-policy evaluation often refers to two related tasks: estimating the expected return of a policy and estimating its value function (or other functions of interest, such as density ratios). While recent works on marginalized importance sampling (MIS) show that the former can enjoy provable guarantees under realizable function approximation, the latter is only known to be feasible under much stronger assumptions such as prohibitively expressive discriminators. In this work, we provide guarantees for off-policy function estimation under only realizability, by imposing proper regularization on the MIS objectives. Compared to commonly used regularization in MIS, our regularizer is much more flexible and can account for an arbitrary user-specified distribution, under which the learned function will be close to the groundtruth. We provide exact characterization of the optimal dual solution that needs to be realized by the discriminator class, which determines the data-coverage assumption in the case of value-function learning. As another surprising observation, the regularizer can be altered to relax the data-coverage requirement, and completely eliminate it in the ideal case with strong side information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题