基于强化学习的设计和实现多代理障碍的避免

论文标题

基于强化学习的设计和实现多代理障碍的避免

The Design and Realization of Multi-agent Obstacle Avoidance based on Reinforcement Learning

论文作者

Zhao, Enyu, Liu, Chanjuan, Su, Houfu, Liu, Yang

论文摘要

智能代理和多代理系统在诸如分组无人机的控制系统以及多代理导航和避免障碍物之类的场景中起着重要作用，这是高级应用程序的基础功能，具有非常重要的。在多代理导航和避免障碍任务中，对于传统的路线规划算法或增强环境复杂性的学习算法而言，代理的决策相互作用和动态变化很难。经典的多代理增强学习算法，多机构深层确定性政策梯度（MADDPG）解决了先验算法的问题，即具有不合时宜的培训过程并且无法处理环境随机性。但是，MADDPG忽略了代理与环境的互动下隐藏的时间信息。此外，由于其CTDE技术使每个代理商的评论家网络都可以计算所有代理的动作和整个环境信息，因此它缺乏扩展到大量代理的能力。为了处理MADDPG对数据的时间信息的无知，本文提出了一种称为MADDPG-LSTMACTOR的新算法，该算法将MADDPG与长期短期内存（LSTM）相结合。通过使用代理对连续时间步的观察作为其策略网络的输入，它允许LSTM层处理隐藏的时间消息。实验结果表明，该算法在代理量较小的情况下具有更好的性能。此外，为了解决MADDPG在代理太多的情况下不高效的缺点，本文提出了一种轻巧的MADDPG（MADDPG-L）算法，这简化了评论家网络的输入。实验的结果表明，当试剂量大时，该算法的性能比MADDPG更好。

Intelligence agents and multi-agent systems play important roles in scenes like the control system of grouped drones, and multi-agent navigation and obstacle avoidance which is the foundational function of advanced application has great importance. In multi-agent navigation and obstacle avoidance tasks, the decision-making interactions and dynamic changes of agents are difficult for traditional route planning algorithms or reinforcement learning algorithms with the increased complexity of the environment. The classical multi-agent reinforcement learning algorithm, Multi-agent deep deterministic policy gradient(MADDPG), solved precedent algorithms' problems of having unstationary training process and unable to deal with environment randomness. However, MADDPG ignored the temporal message hidden beneath agents' interaction with the environment. Besides, due to its CTDE technique which let each agent's critic network to calculate over all agents' action and the whole environment information, it lacks ability to scale to larger amount of agents. To deal with MADDPG's ignorance of the temporal information of the data, this article proposes a new algorithm called MADDPG-LSTMactor, which combines MADDPG with Long short term memory (LSTM). By using agent's observations of continuous timesteps as the input of its policy network, it allows the LSTM layer to process the hidden temporal message. Experimental result demonstrated that this algorithm had better performance in scenarios where the amount of agents is small. Besides, to solve MADDPG's drawback of not being efficient in scenarios where agents are too many, this article puts forward a light-weight MADDPG (MADDPG-L) algorithm, which simplifies the input of critic network. The result of experiments showed that this algorithm had better performance than MADDPG when the amount of agents was large.

下载PDF全文

下载文献需遵守相关版权规定

论文标题