论文标题
关于随机长度MDP的蒙特卡洛UCB的收敛
On the Convergence of Monte Carlo UCB for Random-Length Episodic MDPs
论文作者
论文摘要
在加强学习中,蒙特卡洛算法通过平均偶发回报来更新Q功能。在Monte Carlo UCB(MC-UCB)算法中,在每个状态下采取的动作是最大化Q函数加上上置信度边界(UCB)探索项的动作,这使得对动作的选择偏向于选择频率较低的人。尽管在为MC-UCB建立后悔界限方面已经进行了重大研究,但大多数工作都集中在该问题的有限培训版本上,每个情节都在持续数量的步骤后终止。对于此类有限的Horizon问题,最佳策略既取决于当前状态和情节中的时间。但是,对于许多自然的情节问题,例如Go and Chess和机器人任务等游戏,该情节是随机的,最佳政策是固定的。对于此类环境,MC-UCB中的Q功能是否会收敛到最佳Q函数,这是一个空旷的问题。我们猜测,与Q学习不同,它并不是所有MDP的收敛。尽管如此,我们表明,对于一大批MDP,其中包括随机MDP,例如二十一点和确定性MDP,例如GO,MC-UCB中的Q函数几乎可以肯定地收敛到最佳Q函数。该结果的直接推论是,它几乎肯定会为所有有限的Horizon MDP收敛。我们还提供了数值实验,为MC-UCB提供了进一步的见解。
In reinforcement learning, Monte Carlo algorithms update the Q function by averaging the episodic returns. In the Monte Carlo UCB (MC-UCB) algorithm, the action taken in each state is the action that maximizes the Q function plus an Upper Confidence Bounds (UCB) exploration term, which biases the choice of actions to those that have been chosen less frequently. Although there has been significant work on establishing regret bounds for MC-UCB, most of that work has been focused on finite-horizon versions of the problem, for which each episode terminates after a constant number of steps. For such finite-horizon problems, the optimal policy depends both on the current state and the time within the episode. However, for many natural episodic problems, such as games like Go and Chess and robotic tasks, the episode is of random length and the optimal policy is stationary. For such environments, it is an open question whether the Q-function in MC-UCB will converge to the optimal Q function; we conjecture that, unlike Q-learning, it does not converge for all MDPs. We nevertheless show that for a large class of MDPs, which includes stochastic MDPs such as blackjack and deterministic MDPs such as Go, the Q function in MC-UCB converges almost surely to the optimal Q function. An immediate corollary of this result is that it also converges almost surely for all finite-horizon MDPs. We also provide numerical experiments, providing further insights into MC-UCB.