论文标题
具有峰值约束的MDP的可证明有效的无模型算法
Provably Efficient Model-Free Algorithm for MDPs with Peak Constraints
论文作者
论文摘要
在优化动态系统时,变量通常具有约束。这些问题可以建模为受约束的马尔可夫决策过程(CMDP)。本文考虑了峰值受限的马尔可夫决策过程(PCMDP),其中代理商选择的策略以最大程度地提高有限视野中的总奖励,并在每个时期内满足概率1。我们提出了一种无模型的算法,该算法将PCMDP问题转换为基于Q-Lecterning的方法和基于Q-LEARNING的方法。我们定义了可能对拟议的PCMDP问题近似正确(PAC)的概念。当$ k \geqΩ(\ frac {i^2H^6sa \ ell} {ε^2})$时,提出的算法被证明可以实现$(ε,p)$ PAC策略,其中$ s $和$ a $分别是国家和行动的数量。 $ h $是每集时代的数量。 $ i $是约束函数的数量,$ \ ell = \ log(\ frac {sat} {p})$。我们注意到,这是PCMDP的PAC分析的第一个结果,具有峰值约束,其中未知过渡动力学。我们证明了有关能量收集问题和单个机器调度问题的提议算法,该算法接近研究优化问题的理论上限。
In the optimization of dynamic systems, the variables typically have constraints. Such problems can be modeled as a Constrained Markov Decision Process (CMDP). This paper considers the peak Constrained Markov Decision Process (PCMDP), where the agent chooses the policy to maximize total reward in the finite horizon as well as satisfy constraints at each epoch with probability 1. We propose a model-free algorithm that converts PCMDP problem to an unconstrained problem and a Q-learning based approach is applied. We define the concept of probably approximately correct (PAC) to the proposed PCMDP problem. The proposed algorithm is proved to achieve an $(ε,p)$-PAC policy when the episode $K\geqΩ(\frac{I^2H^6SA\ell}{ε^2})$, where $S$ and $A$ are the number of states and actions, respectively. $H$ is the number of epochs per episode. $I$ is the number of constraint functions, and $\ell=\log(\frac{SAT}{p})$. We note that this is the first result on PAC kind of analysis for PCMDP with peak constraints, where the transition dynamics are not known apriori. We demonstrate the proposed algorithm on an energy harvesting problem and a single machine scheduling problem, where it performs close to the theoretical upper bound of the studied optimization problem.