论文标题
周期性的Q学习
Periodic Q-Learning
论文作者
论文摘要
目标网络的使用是稳定培训的深入强化学习中的一种常见实践。但是,对该技术的理论理解仍然有限。在本文中,我们研究了所谓的周期性Q学习算法(简短的PQ学习),该算法类似于在表格环境中用于解决无限马可分子折现的深Q学习技术(DMDP)。 PQ学习维持两个单独的Q值估计值 - 在线估计和目标估计。在线估计遵循标准Q学习更新之后,而目标估计值则定期更新。与标准的Q学习相反,PQ学习享有一个简单的有限时间分析,并获得了更好的样本复杂性,以找到Epsilon-Timpimal-Timpimal的策略。我们的结果提供了在Q学习算法中利用目标估计或网络的有效性的初步理由。
The use of target networks is a common practice in deep reinforcement learning for stabilizing the training; however, theoretical understanding of this technique is still limited. In this paper, we study the so-called periodic Q-learning algorithm (PQ-learning for short), which resembles the technique used in deep Q-learning for solving infinite-horizon discounted Markov decision processes (DMDP) in the tabular setting. PQ-learning maintains two separate Q-value estimates - the online estimate and target estimate. The online estimate follows the standard Q-learning update, while the target estimate is updated periodically. In contrast to the standard Q-learning, PQ-learning enjoys a simple finite time analysis and achieves better sample complexity for finding an epsilon-optimal policy. Our result provides a preliminary justification of the effectiveness of utilizing target estimates or networks in Q-learning algorithms.