从可解释的专家的混合中进行连续的行动加强学习

论文标题

从可解释的专家的混合中进行连续的行动加强学习

Continuous Action Reinforcement Learning from a Mixture of Interpretable Experts

论文作者

Akrour, Riad, Tateo, Davide, Peters, Jan

论文摘要

增强学习（RL）通过利用非线性函数近似值来证明其解决高维任务的能力。但是，这些成功主要是通过模拟域中的“黑盒”策略实现的。在将RL部署到现实世界中时，可能会提出一些有关使用“ Black-Box”政策的问题。为了使学习的策略更加透明，我们在本文中提出了一种政策迭代方案，该计划保留了其内部价值预测的复杂函数近似值，但基于可解释的专家的混合，将政策限制为具有简洁，层次和人文可读的结构。每个专家都根据与原型状态的距离选择原始动作。保持这样的专家的一个关键设计决定是从轨迹数据中选择原型状态。本文的主要技术贡献是解决这种非差异性典型状态选择程序所带来的挑战。在实验上，我们表明我们提出的算法可以学习有关连续动作深度RL基准测试的引人注目的政策，与基于神经网络的策略的性能相匹配，但是返回的政策比神经网络更适合于人类检查或线性 - 内在策略。

Reinforcement learning (RL) has demonstrated its ability to solve high dimensional tasks by leveraging non-linear function approximators. However, these successes are mostly achieved by 'black-box' policies in simulated domains. When deploying RL to the real world, several concerns regarding the use of a 'black-box' policy might be raised. In order to make the learned policies more transparent, we propose in this paper a policy iteration scheme that retains a complex function approximator for its internal value predictions but constrains the policy to have a concise, hierarchical, and human-readable structure, based on a mixture of interpretable experts. Each expert selects a primitive action according to a distance to a prototypical state. A key design decision to keep such experts interpretable is to select the prototypical states from trajectory data. The main technical contribution of the paper is to address the challenges introduced by this non-differentiable prototypical state selection procedure. Experimentally, we show that our proposed algorithm can learn compelling policies on continuous action deep RL benchmarks, matching the performance of neural network based policies, but returning policies that are more amenable to human inspection than neural network or linear-in-feature policies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题