论文标题
具有神经网络和基于能量的模型的上下文匪徒中的最大熵探索
Maximum entropy exploration in contextual bandits with neural networks and energy based models
论文作者
论文摘要
上下文匪徒可以解决各种各样的现实问题。但是,当前的流行算法可以解决它们,要么依赖于线性模型,要么在非线性模型中估计不可靠的不确定性估计,而这些模型是处理勘探探索折衷所必需的。受到人类认知理论的启发,我们介绍了使用最大的熵探索的新型技术,依靠神经网络在具有连续和离散作用空间的设置中找到最佳策略。我们提出了两类模型,一种具有神经网络作为奖励估计器,另一类带有基于能量的模型,这些模型模拟了根据行动获得最佳奖励的概率。我们在静态和动态的上下文匪徒模拟环境中评估了这些模型的性能。我们表明,两种技术都优于众所周知的标准算法,基于能量的模型具有最佳的整体性能。这为从业者提供了在静态和动态设置中表现良好的新技术,并且特别适合具有连续动作空间的非线性场景。
Contextual bandits can solve a huge range of real-world problems. However, current popular algorithms to solve them either rely on linear models, or unreliable uncertainty estimation in non-linear models, which are required to deal with the exploration-exploitation trade-off. Inspired by theories of human cognition, we introduce novel techniques that use maximum entropy exploration, relying on neural networks to find optimal policies in settings with both continuous and discrete action spaces. We present two classes of models, one with neural networks as reward estimators, and the other with energy based models, which model the probability of obtaining an optimal reward given an action. We evaluate the performance of these models in static and dynamic contextual bandit simulation environments. We show that both techniques outperform well-known standard algorithms, where energy based models have the best overall performance. This provides practitioners with new techniques that perform well in static and dynamic settings, and are particularly well suited to non-linear scenarios with continuous action spaces.