周期性的Q学习

论文标题

周期性的Q学习

Periodic Q-Learning

论文作者

Lee, Donghwan, He, Niao

论文摘要

目标网络的使用是稳定培训的深入强化学习中的一种常见实践。但是，对该技术的理论理解仍然有限。在本文中，我们研究了所谓的周期性Q学习算法（简短的PQ学习），该算法类似于在表格环境中用于解决无限马可分子折现的深Q学习技术（DMDP）。 PQ学习维持两个单独的Q值估计值 - 在线估计和目标估计。在线估计遵循标准Q学习更新之后，而目标估计值则定期更新。与标准的Q学习相反，PQ学习享有一个简单的有限时间分析，并获得了更好的样本复杂性，以找到Epsilon-Timpimal-Timpimal的策略。我们的结果提供了在Q学习算法中利用目标估计或网络的有效性的初步理由。

The use of target networks is a common practice in deep reinforcement learning for stabilizing the training; however, theoretical understanding of this technique is still limited. In this paper, we study the so-called periodic Q-learning algorithm (PQ-learning for short), which resembles the technique used in deep Q-learning for solving infinite-horizon discounted Markov decision processes (DMDP) in the tabular setting. PQ-learning maintains two separate Q-value estimates - the online estimate and target estimate. The online estimate follows the standard Q-learning update, while the target estimate is updated periodically. In contrast to the standard Q-learning, PQ-learning enjoys a simple finite time analysis and achieves better sample complexity for finding an epsilon-optimal policy. Our result provides a preliminary justification of the effectiveness of utilizing target estimates or networks in Q-learning algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题