论文标题
在实例依赖性界限上,用于离线增强学习的线性函数近似
On Instance-Dependent Bounds for Offline Reinforcement Learning with Linear Function Approximation
论文作者
论文摘要
最近对具有线性功能近似的样品离线增强学习(RL)进行了广泛的研究。先前的大部分工作都产生了$ \ tilde {\ Mathcal {o}}(\ frac {1} {\ sqrt {k}})$的最小值 - 最佳界限。在这项工作中,我们试图了解具有函数近似值的离线RL的实例依赖性界限。我们提出了一种称为自举和约束的悲观价值迭代(BCP-VI)的算法,该算法利用数据引导并在悲观之上进行了优化。我们表明,在部分数据覆盖范围的假设下,\ emph {浓度}相对于最佳政策,提出的算法的快速速率为$ \ tilde {\ Mathcal {o}}}(\ frac {1} {1} {k} {k} {k} {k} {k})$,即使是在线rl,即使是最佳的范围,即使是在线范围内,也是如此。收集。此外,当最佳策略跨越行为策略和最佳操作的最佳策略的最佳动作的线性特征是独特的,离线RL会在$ k $超过(有限)实例实例依赖性阈值时达到绝对零的子优先错误。据我们所知,这些是第一个$ \ tilde {\ Mathcal {o}}(\ frac {1} {k} {k})$绑定和绝对的零子典型性,分别绑定了离线rl,其线性函数近似与适应性数据的线性函数近似,并具有部分覆盖。我们还提供实例不合时宜的和实例的信息理论下限,以补充我们的上限。
Sample-efficient offline reinforcement learning (RL) with linear function approximation has recently been studied extensively. Much of prior work has yielded the minimax-optimal bound of $\tilde{\mathcal{O}}(\frac{1}{\sqrt{K}})$, with $K$ being the number of episodes in the offline data. In this work, we seek to understand instance-dependent bounds for offline RL with function approximation. We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI), which leverages data bootstrapping and constrained optimization on top of pessimism. We show that under a partial data coverage assumption, that of \emph{concentrability} with respect to an optimal policy, the proposed algorithm yields a fast rate of $\tilde{\mathcal{O}}(\frac{1}{K})$ for offline RL when there is a positive gap in the optimal Q-value functions, even when the offline data were adaptively collected. Moreover, when the linear features of the optimal actions in the states reachable by an optimal policy span those reachable by the behavior policy and the optimal actions are unique, offline RL achieves absolute zero sub-optimality error when $K$ exceeds a (finite) instance-dependent threshold. To the best of our knowledge, these are the first $\tilde{\mathcal{O}}(\frac{1}{K})$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data with partial coverage. We also provide instance-agnostic and instance-dependent information-theoretical lower bounds to complement our upper bounds.