Multiagent reinforcement learning (MARL) has been used extensively in the game environment. One of the main challenges in MARL is that the environment of the agent system is dynamic, and the other ...agents are also updating their strategies. Therefore, modeling the opponents’ learning process and adopting specific strategies to shape learning is an effective way to obtain better training results. Previous studies such as DRON, LOLA and SOS approximated the opponent’s learning process and gave effective applications. However, these studies modeled only transient changes in opponent strategies and lacked stability in the improvement of equilibrium efficiency. In this article, we design the MOL (modeling opponent learning) method based on the Stackelberg game. We use best response theory to approximate the opponents’ preferences for different actions and explore stable equilibrium with higher rewards. We find that MOL achieves better results in several games with classical structures (the Prisoner’s Dilemma, Stackelberg Leader game and Stag Hunt with 3 players), and in randomly generated bimatrix games. MOL performs well in competitive games played against different opponents and converges to stable points that score above the Nash equilibrium in repeated game environments. The results may provide a reference for the definition of equilibrium in multiagent reinforcement learning systems, and contribute to the design of learning objectives in MARL to avoid local disadvantageous equilibrium and improve general efficiency.
Theory of mind refers to the ability to reason explicitly about unobservable mental content of others, such as beliefs, goals, and intentions. People often use this ability to understand the behavior ...of others as well as to predict future behavior. People even take this ability a step further, and use
higher-order theory of mind
by reasoning about the way others make use of theory of mind and in turn attribute mental states to different agents. One of the possible explanations for the emergence of the cognitively demanding ability of higher-order theory of mind suggests that it is needed to deal with mixed-motive situations. Such mixed-motive situations involve partially overlapping goals, so that both cooperation and competition play a role. In this paper, we consider a particular mixed-motive situation known as Colored Trails, in which computational agents negotiate using alternating offers with incomplete information about the preferences of their trading partner. In this setting, we determine to what extent higher-order theory of mind is beneficial to computational agents. Our results show limited effectiveness of first-order theory of mind, while second-order theory of mind turns out to benefit agents greatly by allowing them to reason about the way they can communicate their interests. Additionally, we let human participants negotiate with computational agents of different orders of theory of mind. These experiments show that people spontaneously make use of second-order theory of mind in negotiations when their trading partner is capable of second-order theory of mind as well.
In automated bilateral multi issue negotiations, two intelligent automated agents negotiate on behalf of their owners regarding many issues in order to reach an agreement. Modeling the opponent can ...excessively boost the performance of the agents and increase the quality of the negotiation outcome. State of the art models accomplish this by considering some assumptions about the opponent which restricts the applicability of the models in real scenarios. In this study, a less restricted technique where perceptron units (POPPONENT) are applied in modeling the preferences of the opponent is proposed. This model adopts the Multi Bipartite version of the Standard Gradient Descent search algorithm (MBGD) to find the best hypothesis, which is the best preference profile. In order to evaluate the accuracy and performance of this proposed opponent model, it is compared with the state of the art models available in the Genius repository. This results in the devised setting which approves the higher accuracy of POPPONENT compared to the most accurate state of the art model. Evaluating the model in the real world negotiation scenarios in the Genius framework also confirms its high accuracy in relation to the state of the art models in estimating the utility of offers. The findings here indicate that this proposed model is individually and socially efficient. This proposed MBGD method could also be adopted in other practical areas of Artificial Intelligence.
In recent years, agreement technologies have garnered interest among agents in the field of multi-agent systems. Automated negotiation is one of the agreement technologies, in which agents negotiate ...with each other to make an agreement so that they can solve conflicts between their preferences. Although most agents keep their own preferences private, it is necessary to estimate the opponent's preferences to obtain a better agreement. Therefore, opponent modeling is one of the most important elements in automated negotiating strategy. A frequency model is widely used for opponent modeling because of its robustness against various types of strategy while being easy to implement. However, existing frequency models do not consider the opponent's proposal speed and the transition of offers. This study proposes a novel frequency model that considers the opponent's behavior using two main elements: the offer ratio and the weighting function. The offer ratio stabilizes the model against changes in the opponent's offering speed, whereas the weighting function takes the opponent's concession into account. The two experiments conducted herein show that our proposed model is more accurate than other frequency models. Additionally, we find that the agent with the proposed model performs with a significantly higher utility value in negotiations.
With the development of deep reinforcement learning (DRL), much progress in various perfect and imperfect information games has been achieved. Among these games, DouDizhu, a popular card game in ...China, poses great challenges because of the imperfect information, large state and action space as well as the cooperation issue. In this paper, we put forward an AI system for this game, which adopts opponent modeling and coach-guided training to help agents make better decisions when playing cards. Besides, we take the bidding phase of DouDizhu into consideration, which is usually ignored by existing works, and train a bidding network using Monte-Carlo simulation. As a result, we achieve a full version of our AI system that is applicable to real-world competitions. We conduct extensive experiments to evaluate the effectiveness of the three techniques adopted in our method and demonstrate the superior performance of our AI over the state-of-the-art DouDizhu AI, i.e., DouZero. We upload our AI systems, one is bidding-free and the other is equipped with a bidding network, to Botzone platform and they both rank the first among over 400 and 250 AI programs on the two corresponding leaderboards, respectively. Our codes are available at https://github.com/submit-paper/Doudizhu_plus .
Determining an effective strategy for intelligent agents in multilateral negotiations is a more complicated problem than in bilateral negotiations. In order to achieve an optimal and beneficial ...agreement the agent needs to consider the behavior and desired utility of more than one opponent, determine a concession tactic based on a smaller agreement space, and use a computationally efficient mechanism for generating optimal offers. However, a mere extension of bilateral negotiation strategies cannot be effective in multilateral negotiations because the nature of most bilateral negotiation strategies is based on interaction with only one opponent and tracking a single behavior during the negotiation process. In this paper, we propose an adaptive approach based on a multi-party perspective to determine multilateral negotiation strategy. The proposed approach applies the BOA framework (Bidding, Opponent model, and Acceptance) and dynamically models the opponents’ preference profiles. In order to estimate the obtainable utility from opponents and help find a good offer, the agent uses an ensemble model made by individual frequency-based opponent models and a different level of attention to each party’s behavior. The proposed approach also implements a bidding strategy which applies the opponents’ desirable utility to adapt the agent’s concession tactic and produce appropriate offers. The results of experimental evaluations on various negotiation scenarios against the state of the art multilateral negotiation strategies show that our proposed strategy can provide superior performance in both individual utility and social welfare and lead to more optimal and fairer agreements.
•Investigated opponent modeling in automated negotiation.•Fuzzifyed the stakeholders evaluation models based on weighted preference limits.•Proposed a recursive learning approach to learn the ...parameters of these models.•A probabilistic model is applied that uses the learned criteria to find a proposal.
Automated negotiation is a toolset to model human interactions during a negotiation process with the aim of improving the efficiency and quality of decision-making using advanced information analytics. During the negotiation, the participants share their viewpoints and concerns about the negotiation issues. However, in reality, they usually do not reveal the details of their preferences to one another. Therefore, modeling and learning opponents’ behavior is a crucial component of automated negotiation. In this paper, we propose an estimation technique based on recursive Bayesian filtering to facilitate opponent-modeling and -learning in the context of multi-participant, multi-issue negotiations. In the proposed technique, opponents’ preference profiles are modeled using fuzzy functions, which are very close to the way humans evaluate alternatives. As the negotiation progresses, the agents can recursively learn the parameters of these models in real time. The only required information for this learning process includes the feedback and the arguments the participants may provide in support of their decisions. At each round, a probabilistic graphical model is also implemented that utilizes the learned preference limits of the participants to offer a new proposal with a high probability of satisfying the participants and reaching an agreement. The proposed methodology is examined in two different negotiation contexts: energy-system development and real estate service. The experiments show that the proposed opponent modeling/learning approach increases the efficiency of the negotiation up to 85% and facilitates reaching an agreement in fewer rounds of negotiation without requiring any prior understanding of the negotiation participants.
In automatic negotiation, intelligent agents try to reach the best deal possible on behalf of their owners. In previous studies, opponent modeling of a negotiator agent has been used to tune the ...final bid out of a group of bids chosen by the agent’s strategy. In this research, a time-based bidding strategy has been introduced, which uses the opponent model to concede more adaptively to the opponents, thereby achieving an improved utility, social welfare, and fairness for the agent. By modeling the preference profile of the opponent during the negotiation session, this strategy sets its concession factor proportional to the model. Experiments show that in comparison to state-of-the-art agents, this agent makes better agreements in terms of individual utility and social welfare in small and medium-sized domains and can, in some cases, increase the performance up to 10%. The proposed agent successfully gets the deal up to 37% closer to best social bids in terms of distance to the Pareto frontier and the Nash point. An implementation based on the proposed strategy was used in an agent called AgreeableAgent, which participated in the international ANAC 2018 and won first place in individual utility rankings.
One challenging problem in multiagent systems is to cooperate or compete with non-stationary agents that change behavior from time to time. An agent in such a non-stationary environment is usually ...supposed to be able to quickly detect the other agents’ policy during online interaction, and then adapt its own policy accordingly. This article studies efficient policy detecting and reusing techniques when playing against non-stationary agents in cooperative or competitive Markov games. We propose a new deep Bayesian policy reuse algorithm, a.k.a. DPN-BPR+, by extending the recent BPR+ algorithm with a neural network as the value-function approximator. To detect policy accurately, we propose the
rectified belief model
taking advantage of the
opponent model
to infer the other agents’ policy from reward signals and its behavior. Instead of directly storing individual policies as BPR+, we introduce
distilled policy network
that serves as the policy library, and policy distillation to achieve efficient online policy learning and reuse. DPN-BPR+ inherits all the advantages of BPR+. In experiments, we evaluate DPN-BPR+ in terms of detection accuracy, cumulative reward and speed of convergence in four complex Markov games with raw visual inputs, including two cooperative games and two competitive games. Empirical results show that our proposed DPN-BPR+ approach has better performance than existing algorithms in all these Markov games.