QR-MIX：合作多代理增强学习的分配价值函数分解

论文标题

QR-MIX：合作多代理增强学习的分配价值函数分解

QR-MIX: Distributional Value Function Factorisation for Cooperative Multi-Agent Reinforcement Learning

论文作者

Hu, Jian, Harding, Seth Austin, Wu, Haibin, Hu, Siyue, Liao, Shih-wei

论文摘要

在合作的多代理增强学习（MARL）和通过分散执行（CTDE）的集中培训的情况下，代理商在本地和独立的环境中观察并与其环境进行了互动。通过局部观察和随机抽样，奖励和观察的随机性会导致长期回报的随机性。现有方法，例如价值分解网络（VDN）和QMIX估计长期收益的价值作为不包含随机性信息的标量。我们提出的模型QR-MIX引入了分位数回归，将QMIX与隐式分位数网络（IQN）结合起来将联合状态行动值建模为分布。但是，QMIX中的单调性限制了联合状态行动值分布的表达，并可能导致非单调病例的估计结果不正确。因此，我们提出了一个柔性损耗函数，以近似QMIX中发现的单调性。我们的模型不仅对回报的随机性更宽容，而且更宽容单调约束的随机性。实验结果表明，QR-MIX在星际争霸多代理挑战（SMAC）环境中的表现优于先前的最先进方法QMIX。

In Cooperative Multi-Agent Reinforcement Learning (MARL) and under the setting of Centralized Training with Decentralized Execution (CTDE), agents observe and interact with their environment locally and independently. With local observation and random sampling, the randomness in rewards and observations leads to randomness in long-term returns. Existing methods such as Value Decomposition Network (VDN) and QMIX estimate the value of long-term returns as a scalar that does not contain the information of randomness. Our proposed model QR-MIX introduces quantile regression, modeling joint state-action values as a distribution, combining QMIX with Implicit Quantile Network (IQN). However, the monotonicity in QMIX limits the expression of joint state-action value distribution and may lead to incorrect estimation results in non-monotonic cases. Therefore, we proposed a flexible loss function to approximate the monotonicity found in QMIX. Our model is not only more tolerant of the randomness of returns, but also more tolerant of the randomness of monotonic constraints. The experimental results demonstrate that QR-MIX outperforms the previous state-of-the-art method QMIX in the StarCraft Multi-Agent Challenge (SMAC) environment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题