How do people adapt to others in adversarial settings? Prior work has shown that people often violate rational models of adversarial decision-making in repeated interactions. In particular, in mixed ...strategy equilibrium (MSE) games, where optimal action selection entails choosing moves randomly, people often do not play randomly, but instead try to outwit their opponents. However, little is known about the adaptive reasoning that underlies these deviations from random behavior. Here, we examine strategic decision-making across repeated rounds of rock, paper, scissors, a well-known MSE game. In experiment 1, participants were paired with bot opponents that exhibited distinct stable move patterns, allowing us to identify the bounds of the complexity of opponent behavior that people can detect and adapt to. In experiment 2, bot opponents instead exploited stable patterns in the human participants’ moves, providing a symmetrical bound on the complexity of patterns people can revise in their own behavior. Across both experiments, people exhibited a robust and flexible attention to transition patterns from one move to the next, exploiting these patterns in opponents and modifying them strategically in their own moves. However, their adaptive reasoning showed strong limitations with respect to more sophisticated patterns. Together, results provide a precise and consistent account of the surprisingly limited scope of people’s adaptive decision-making in this setting.
This letter studies the problem of dynamic jamming power allocation (JPA) with incomplete sensing information in dynamic and unknown environments. Most existing studies assume that jammers have ...perfect sensing information, without the consideration of the reliability of sensors, leading to poor performance when facing incomplete sensing information. In response, we propose a robust intelligent JPA scheme and adopt a two-stage algorithm namely "offline training and online deploy" to guide the training and deployment process. The approach utilizes the observed values from spatially distributed sensor units as inputs and employs deep reinforcement learning (DRL) for jamming power decision-making. To handle the missing observation data, we designed a data completion module based on the generative adversarial networks (GAN) framework. In addition, we introduce the priority experience replay mechanism (PERM) and opponent modeling (OM) in the decision model to enhance the learning efficiency and decision accuracy of the network. Simulation results show that the proposed approach can achieve efficient jamming with incomplete information, and outperform conventional DRL-based approaches.
Multi-agent reinforcement learning (MARL) is an abstract framework modeling a dynamic environment that involves multiple learning and decision-making agents, each of which tries to maximize her ...cumulative reward. In MARL, each agent discovers a strategy alongside others and adapts her policy in response to the behavioural changes of others. A fundamental difficulty faced by MARL is that every agent is dynamically learning and changing to improve her reward, making the whole system unstable and agents’ policies difficult to converge. In this paper, we introduce the entropy regularizer into the Bellman equation and utilize Lagrange approach to optimize the entropy regularizer. We then propose a MARL algorithm based on the maximum entropy principle and the actor-critic method. This algorithm follows the policy gradient approach and uses a policy network and a value network. We call it Multi-Agent Deep Soft Policy Gradient (MADSPG). Then by using the Lagrange approach and dynamic minimax optimization, we propose the AUTO-MADSPG algorithm with an automatically adjusted entropy regularizer. These algorithms make multi-agent learning more stable while sufficient exploration is guaranteed. Finally, we also incorporate MADSPG with the recently proposed opponent modeling component into an integrated framework. This framework outperforms many state-of-the-art MARL algorithms in conventional cooperative and competitive game settings.
Urban land-use planning decisions generally require negotiation between multiple stakeholders to reach an agreement on a specific plan. Computer-aided tools such as group decision support systems can ...facilitate the actors in this complicated process. In the context of these systems, using software agents enhance the effectiveness and efficiency of group decision support. The software agents can perform some computational and analytical tasks on behalf of the stakeholders. In more advanced cases, the agents can also learn stakeholders’ preferences and behavior to help them make good decisions. This paper proposes an intelligent web-based spatial group decision support system to investigate the role of opponents modeling in urban land use planning by using a multi-agent system approach. For this purpose, two successive meetings are held in which the system is used: in the first meeting, the stakeholders revise the existing plans and respond to other stakeholders’ requests. During the meeting, software agents attempt to model the behavior of the stakeholders they are associated with, based on a Bayesian learning method in combination with social value orientation theory to describe stakeholders’ decision behavior in a group context. In the second meeting, the software agents help the stakeholders in the step of plan revision by providing the information obtained to the stakeholders. In an application, a comparison of the results of the meetings showed that the provided information about the opponents reduced the negotiation time and contributed to reaching a better spatial configuration of land-uses based on a criterion provided by social value orientation theory.
•It provides a web-based intelligent Group Decision Support System for land use planning by using multi-agent systems.•It investigates the influence of providing information about the other participants on the results of a group meeting.•It utilizes Social Value Orientation (SVO) theory and Bayesian learning to model the participating stakeholders.
The non-stationarity of the environment is a crucial challenge for competitive Multi-Agent Reinforcement Learning (MARL) due to the constantly changing opponent policy. Existing schemes are ...challenging to make the protagonist agent that agilely responds to the opponent’s changes and the resulting non-stationarity, which may inevitably limit their applicability. To address the dynamic opponent policy and adapt to the non-stationary environment continuously, we propose a Temporal Convolutional Network (TCN) model for modeling and predicting opponent behaviors called OM-TCN, and apply it to the widely-used Multi-Agent Deep Deterministic Policies Gradient (MADDPG) algorithm of competitive MARL. In this work, we collect the opponent’s behavior data observed by the protagonist agent and serialize it in granularity of episodes. Then we input the time-series data into OM-TCN for sequence modeling. The OM-TCN learns the historical behaviors of the opponent instead of overfitting to a specific opponent policy, and can make predictions about the opponent’s future actions. Finally, we use predictions of opponent actions in place of the history sampled from the playback buffer, and apply the OM-TCN model to the MADDPG framework for decentralized training. We use the competitive scenario of Multi-agent Particle Environment (MPE) to evaluate the proposed method. Simulation results show that the protagonist agent is able to learn more efficient and stable policy and converge easier than other baselines.
In multi-agent reinforcement learning, multiple agents learn simultaneously while interacting with a common environment and each other. Since the agents adapt their policies during learning, not only ...the behavior of a single agent becomes non-stationary, but also the environment as perceived by the agent. This renders it particularly challenging to perform policy improvement. In this paper, we propose to exploit the fact that the agents seek to improve their expected cumulative reward and introduce a novel Time Dynamical Opponent Model (TDOM) to encode the knowledge that the opponent policies tend to improve over time. We motivate TDOM theoretically by deriving a lower bound of the log objective of an individual agent and further propose Multi-Agent Actor-Critic with Time Dynamical Opponent Model (TDOM-AC). We evaluate the proposed TDOM-AC on a differential game and the Multi-agent Particle Environment. We show empirically that TDOM achieves superior opponent behavior prediction during test time. The proposed TDOM-AC methodology outperforms state-of-the-art Actor-Critic methods on the performed tasks in cooperative and especially in mixed cooperative-competitive environments. TDOM-AC results in a more stable training and a faster convergence. Our code is available at https://github.com/Yuantian013/TDOM-AC.
A negotiation between agents is typically an incomplete information game, where the agents initially do not know their opponent’s preferences or strategy. This poses a challenge, as efficient and ...effective negotiation requires the bidding agent to take the other’s wishes and future behavior into account when deciding on a proposal. Therefore, in order to reach better and earlier agreements, an agent can apply learning techniques to construct a
model
of the opponent. There is a mature body of research in negotiation that focuses on modeling the opponent, but there exists no recent survey of commonly used opponent modeling techniques. This work aims to advance and integrate knowledge of the field by providing a comprehensive survey of currently existing opponent models in a bilateral negotiation setting. We discuss all possible ways opponent modeling has been used to benefit agents so far, and we introduce a taxonomy of currently existing opponent models based on their underlying learning techniques. We also present techniques to measure the success of opponent models and provide guidelines for deciding on the appropriate performance measures for every opponent model type in our taxonomy.
Existing investigations of opponent modeling and intention inferencing cannot make clear descriptions and practical explanations of the opponent's behaviors and intentions, which may inevitably limit ...the applicability of them. In this work, we propose a novel approach for opponent's policy explanation and intention inference based on the behavioral portrait of opponent. Specifically, we use the multiagent deep deterministic policy gradients (MADDPG) algorithm to train the agent and opponent in the competitive environment, and collect the behavioral data of opponent based on agent's observations. Then we perform pattern segmentation and extract the opponent's behavior events via Toeplitz inverse covariance‐based clustering (TICC) algorithm; hence the opponent's behavior data can be encoded into a knowledge graph, named opponent's behavior knowledge graph (OKG). Based on this, we built a question‐answer system (QA system) to query and match opponent historical information in OKG, so that the agent can obtain additional experience and gradually infer the intention of opponent with the episodes of iteration. We evaluate the proposed method on the competitive scenario in multiagent particle environment (MPE). Simulation results show that the agents are able to learn better policies with opponent portrait in competitive settings.
In Markov games, how to respond quickly and optimally for an agent against opponents that follow changing policies is an open problem. Most state-of-the-art algorithms assume that players only change ...their policies at the end of an episode, and the agent can obtain the same optimal episodic rewards by accurately detecting the opponent policy. However, the opponent may change its policies within an episode, or switch to an unknown policy. Besides, the agent is more likely to achieve inconsistent optimal returns because of different opponent policies, which brings greater challenges to policy detection. In an effort to overcome these challenges, this paper proposes an algorithm to achieve accurate opponent policy detection and efficient knowledge reuse. Within an episode, an inter-episode belief and an intra-episode belief are jointly used to continuously infer the opponent’s identity taking into account the episodic rewards and opponent models. Then, the agent can reuse the best response policy directly. We also detect whether the opponent adopts an unknown policy based on performance models after each episode. For the detected unknown opponent type, we model the previously learned policies as corresponding options for indirect knowledge reuse. Moreover, an option-based knowledge reuse (OKR) network is introduced to guide new response policy learning by adaptively reusing useful knowledge from the existing learned policies. We demonstrate the advantages of the proposed algorithm over several state-of-the-art algorithms in three competitive scenarios.
•An intra-episode belief continuously guides policy selection.•Episodic rewards and opponent models are used to infer the opponent policy.•Our approach can track the opponent who switches its policy within an episode.•Opponent policy switch frequencies do not degrade the agent’s performance.•Previously learned knowledge is used against an unknown opponent type.
Opponent modeling is necessary for autonomous agents to capture the intents of others during strategic interactions. Most previous works assume that they can access enough interaction history to ...build the model. However, it may not be realistic. To solve this problem, we present a novel rationality-consistent opponent modeling (ROM) method for games with imperfect information. In our approach, a game-theoretical concept of consistence about rationality is proposed to take advantage of the characteristic of imperfect information sequential games that rational behavior at disjoint information sets is correlated through anticipated opponent's behavior. With the correlation between different information sets, agents could infer the opponents' strategies at information sets correlated to observed behavior. To exploit the correlation, ROM attempts to conduct reasoning from the opponent's perspective and rationalize its past behavior. In this way, ROM acquires the ability to better adapt to different opponents and achieves a more accurate opponent model with insufficient observation history, which is verified by experiments in different settings. A heuristic adaptation approach is also applied in ROM, which updates the opponent model in an online manner and significantly reduces the computation cost. We evaluate ROM in both a grid world game and a poker game. Compared with other opponent modeling methods, ROM shows better performance and has more accurate predictions in both games against different types of opponents with limited action interactions. Experimental results also show that ROM's time cost is significantly reduced through heuristic adaptation.