马尔可夫决策过程中的安全加强学习

论文标题

马尔可夫决策过程中的安全加强学习

Safe Reinforcement Learning in Constrained Markov Decision Processes

论文作者

Wachi, Akifumi, Sui, Yanan

论文摘要

安全加强学习是一种优化在安全至关重要应用中运作的代理商政策的有前途的方法。在本文中，我们提出了一种算法SNO-MDP，该算法在未知的安全限制下探索并优化了马尔可夫决策过程。具体而言，我们采用逐步的方法来优化安全性和累积奖励。在我们的方法中，代理商首先通过扩展安全区域来学习安全限制，然后优化经过认证的安全区域的累积奖励。我们在适当的规律性假设下对安全限制的满意度和累积奖励的近乎偏见性提供了理论保证。在我们的实验中，我们通过两个实验证明了SNO-MDP的有效性：一种在名为GP-SAFETY-GYM的新的，公开可用的环境中使用合成数据，而另一个通过使用真实的观察数据模拟了MARS表面探索。

Safe reinforcement learning has been a promising approach for optimizing the policy of an agent that operates in safety-critical applications. In this paper, we propose an algorithm, SNO-MDP, that explores and optimizes Markov decision processes under unknown safety constraints. Specifically, we take a stepwise approach for optimizing safety and cumulative reward. In our method, the agent first learns safety constraints by expanding the safe region, and then optimizes the cumulative reward in the certified safe region. We provide theoretical guarantees on both the satisfaction of the safety constraint and the near-optimality of the cumulative reward under proper regularity assumptions. In our experiments, we demonstrate the effectiveness of SNO-MDP through two experiments: one uses a synthetic data in a new, openly-available environment named GP-SAFETY-GYM, and the other simulates Mars surface exploration by using real observation data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题