Akademska digitalna zbirka SLovenije - logo
E-viri
Celotno besedilo
Recenzirano
  • Accurate policy detection a...
    Chen, Hao; Liu, Quan; Fu, Ke; Huang, Jian; Wang, Chang; Gong, Jianxing

    Knowledge-based systems, 04/2022, Letnik: 242
    Journal Article

    In Markov games, how to respond quickly and optimally for an agent against opponents that follow changing policies is an open problem. Most state-of-the-art algorithms assume that players only change their policies at the end of an episode, and the agent can obtain the same optimal episodic rewards by accurately detecting the opponent policy. However, the opponent may change its policies within an episode, or switch to an unknown policy. Besides, the agent is more likely to achieve inconsistent optimal returns because of different opponent policies, which brings greater challenges to policy detection. In an effort to overcome these challenges, this paper proposes an algorithm to achieve accurate opponent policy detection and efficient knowledge reuse. Within an episode, an inter-episode belief and an intra-episode belief are jointly used to continuously infer the opponent’s identity taking into account the episodic rewards and opponent models. Then, the agent can reuse the best response policy directly. We also detect whether the opponent adopts an unknown policy based on performance models after each episode. For the detected unknown opponent type, we model the previously learned policies as corresponding options for indirect knowledge reuse. Moreover, an option-based knowledge reuse (OKR) network is introduced to guide new response policy learning by adaptively reusing useful knowledge from the existing learned policies. We demonstrate the advantages of the proposed algorithm over several state-of-the-art algorithms in three competitive scenarios. •An intra-episode belief continuously guides policy selection.•Episodic rewards and opponent models are used to infer the opponent policy.•Our approach can track the opponent who switches its policy within an episode.•Opponent policy switch frequencies do not degrade the agent’s performance.•Previously learned knowledge is used against an unknown opponent type.