This paper presents MiniZero , a zero-knowledge learning framework that supports four state-of-the-art algorithms, including AlphaZero, MuZero, Gumbel AlphaZero, and Gumbel MuZero. While these ...algorithms have demonstrated super-human performance in many games, it remains unclear which among them is most suitable or efficient for specific tasks. Through MiniZero , we systematically evaluate the performance of each algorithm in the two board games, 9x9 Go and 8x8 Othello, as well as 57 Atari games. For the two board games, using more simulations generally results in higher performance. However, the choice between AlphaZero and MuZero may differ based on game properties. For Atari games, both MuZero and Gumbel MuZero are worth considering. Since each game has unique characteristics, different algorithms and simulations yield varying results. In addition, we introduce an approach, called progressive simulation, which progressively increases the simulation budget during training to allocate computation more efficiently. Our empirical results demonstrate that progressive simulation achieves significantly superior performance in the two board games. By making our framework and trained models publicly available, this paper contributes a benchmark for future research on zero-knowledge learning algorithms, assisting researchers in algorithm selection and comparison against these zero-knowledge learning baselines. Our code and data are available at https://rlg.iis.sinica.edu.tw/papers/minizero .
El presente artículo tiene como objetivo mostrar la evolución del desarrollo de agentes inteligentes capaces de jugar juegos de tablero. Se muestra una breve revisión histórica de los agentes que se ...han desarrollado para diversos juegos y se describe el agente AlphaZero, creado por DeepMind, el cual hoy en día es el agente más avanzado en esta área es capaz de vencer a campeones humanos en el juego de Go, el cual se considera el juego de tablero más complejo que existe, incluso más que el ajedrez. De igual manera, se da una discusión sobre el paso natural que seguirán los agentes jugadores debido al éxito de AlphaZero y al surgimiento del paradigma de General Game Playing, el cual busca la creación de agentes capaces de jugar cualquier juego de tablero sin ninguna intervención humana.
Natural disasters such as storms usually bring significant damages to distribution grids. This paper investigates the optimal routing of utility vehicles to restore outages in the distribution grid ...as fast as possible after a storm. First, the post-storm repair crew dispatch task with multiple utility vehicles is formulated as a sequential stochastic optimization problem. In the formulated optimization model, the belief state of the power grid is updated according to the phone calls from customers and the information collected by utility vehicles. Second, an AlphaZero based utility vehicle routing (AlphaZero-UVR) approach is developed to achieve the real-time dispatching of the repair crews. The proposed AlphaZero-UVR approach combines stochastic Monte-Carlo tree search (MCTS) with deep neural networks to give a lookahead search decisions, which can learn to navigate repair crews without human guidance. Simulation results show that the proposed approach can efficiently navigate crews to repair all outages.
•Formulating a post-storm repair crew dispatch model with multiple vehicles.•Proposing AlphaZero based post-storm utility vehicle routing strategy.•Modifying the original AlphaZero algorithm by combining DNN with stochastic MCTS.•The proposed DRL based method outperforms traditional MCTS methods.
Abstract To address the issue of the quantum approximate optimization algorithm frequently encountering local minima and the cost of parameter optimization within complex non-convex optimization ...energy landscapes, we consider a warm-start method. This approach leverages the characteristics of transition states in the enhanced optimizer, specifically descending along unique negative curvature directions, to find smaller local minima. Our research results indicate that with the assistance of an enhanced pre-training structure of the AlphaZero AI model, the initialization generalization ability of the new optimizer is significantly enhanced across various test sets. We train on 2-SAT training sets with clause densities between α ≈ 2.6 and α ≈ 2.89, and transfer to more complex test sets. Additionally, the average residual energy density in transfer learning consistently remains below 0.01, even achieving a high transfer success probability of 98% in hard instances with α ≈ 3.7. The search efficiency, pre-trained by ensemble learning, was significantly enhanced, while only requiring simple interpolation of a few transition points to transfer on the global optimal solutions at higher sample clause densities.
Endgame studies have long served as a tool for testing human creativity and intelligence. We find that they can serve as a tool for testing machine ability as well. Two of the leading chess engines, ...Stockfish and Leela Chess Zero (LCZero), employ significantly different methods during play. We use Plaskett's Puzzle, a famous endgame study from the late 1970s, to compare the two engines. Our experiments show that Stockfish outperforms LCZero on the puzzle. We examine the algorithmic differences between the engines and use our observations as a basis for carefully interpreting the test results. Drawing inspiration from how humans solve chess problems, we ask whether machines can possess a form of imagination. On the theoretical side, we describe how Bellman's equation may be applied to optimize the probability of winning. To conclude, we discuss the implications of our work on artificial intelligence (AI) and artificial general intelligence (AGI), suggesting possible avenues for future research.
The purpose of this paper is to propose and develop a new conceptual framework for approximate Dynamic Programming (DP) and Reinforcement Learning (RL). This framework centers around two algorithms, ...which are designed largely independently of each other and operate in synergy through the powerful mechanism of Newton’s method. We call these the off-line training and the on-line play algorithms; the names are borrowed from some of the major successes of RL involving games. Primary examples are the recent (2017) AlphaZero program (which plays chess), and the similarly structured and earlier (1990s) TD-Gammon program (which plays backgammon). In these game contexts, the off-line training algorithm is the method used to teach the program how to evaluate positions and to generate good moves at any given position, while the on-line play algorithm is the method used to play in real time against human or computer opponents.
Both AlphaZero and TD-Gammon were trained off-line extensively using neural networks and an approximate version of the fundamental DP algorithm of policy iteration. Yet the AlphaZero player that was obtained off-line is not used directly during on-line play (it is too inaccurate due to approximation errors that are inherent in off-line neural network training). Instead a separate on-line player is used to select moves, based on multistep lookahead minimization and a terminal position evaluator that was trained using experience with the off-line player. The on-line player performs a form of policy improvement, which is not degraded by neural network approximations. As a result, it greatly improves the performance of the off-line player.
Similarly, TD-Gammon performs on-line a policy improvement step using one-step or two-step lookahead minimization, which is not degraded by neural network approximations. To this end it uses an off-line neural network-trained terminal position evaluator, and importantly it also extends its on-line lookahead by rollout (simulation with the one-step lookahead player that is based on the position evaluator).
An important lesson from AlphaZero and TD-Gammon is that the performance of an off-line trained policy can be greatly improved by on-line approximation in value space, with long lookahead (involving minimization or rollout with the off-line policy, or both), and terminal cost approximation that is obtained off-line. This performance enhancement is often dramatic and is due to a simple fact, which is couched on algorithmic mathematics and is the focal point of this work:
(a) Approximation in value space with one-step lookahead minimization amounts to a step of Newton’s method for solving Bellman’s equation.
(b) The starting point for the Newton step is based on the results of off-line training, and may be enhanced by longer lookahead minimization and on-line rollout. Indeed the major determinant of the quality of the on-line policy is the Newton step that is performed on-line, while off-line training plays a secondary role by comparison.
Significantly, the synergy between off-line training and on-line play also underlies Model Predictive Control (MPC), a major control system design methodology that has been extensively developed since the 1980s. This synergy can be understood in terms of abstract models of infinite horizon DP and simple geometrical constructions, and helps to explain the all-important stability issues within the MPC context. In this work we aim to provide insights (often based on visualization), which explain the beneficial effects of on-line decision making on top of off-line training. In the process, we will bring out the strong connections between the artificial intelligence view of RL, and the control theory views of MPC and adaptive control. While we will deemphasize mathematical proofs, there is considerable related analysis, which supports our conclusions and can be found in the author’s recent RL books (Bertsekas, 2019; Bertsekas, 2020), and the abstract DP monograph (Bertsekas, 2022).
One of our principal aims is to show, through the algorithmic ideas of Newton’s method and the unifying principles of abstract DP, that the AlphaZero/TD-Gammon methodology of approximation in value space and rollout applies very broadly to deterministic and stochastic optimal control problems, involving both discrete and continuous search spaces, as well as finite and infinite horizon.
On the Value of Chess Squares Gupta, Aditya; Maharaj, Shiva; Polson, Nicholas ...
Entropy,
09/2023, Letnik:
25, Številka:
10
Journal Article
Recenzirano
Odprti dostop
We propose a neural network-based approach to calculate the value of a chess square–piece combination. Our model takes a triplet (color, piece, square) as the input and calculates a value that ...measures the advantage/disadvantage of having this piece on this square. Our methods build on recent advances in chess AI, and can accurately assess the worth of positions in a game of chess. The conventional approach assigns fixed values to pieces (= ∞, = 9, = 5, = 3, = 3, = 1). We enhance this analysis by introducing marginal valuations. We use deep Q-learning to estimate the parameters of our model. We demonstrate our method by examining the positioning of knights and bishops, and also provide valuable insights into the valuation of pawns. Finally, we conclude by suggesting potential avenues for future research.
The Cross-Taiwan Strait Railway (CTSR) is a significant construction, and the automatic driving of high-speed trains (HSTs) on the CTSR is a crucial technology for its operation. Before opening the ...CTSR, it is forward-looking to conduct theoretical research. Inspired by AlphaZero, this paper combines expert experience and Newtonian mechanics to propose a method on automatically generating massive virtual driving curves for HSTs of the CTSR. To compress the solution space from infinite to finite, we set the driving acceleration and speed limit boundary of HSTs from the perspective of expert experience. The model then combines big data technology to generate massive virtual driving curves automatically and uses statistical analysis to select high-performance ones. We found that 1) numerous virtual driving curves can be automatically generated according to different restricted speeds and line conditions, with solid adaptability; 2) the driving curves cover a specific running time and reach all running times in the range, with good ergodicity; 3) real-time tracking of the running time of each driving curve by 0.1 s, with high accuracy; 4) lots of driving curves are available on each run time for selecting curves with best performance, with good selectivity. In conclusion, this method can automatically generate virtual driving curves for HSTs on the CTSR, providing high-quality virtual data for future research on the automatic driving of HSTs on the CTSR.
Computer games have been regarded as an important field of artificial intelligence (AI) for a long time. The AlphaZero structure has been successful in the game of Go, beating the top professional ...human players and becoming the baseline method in computer games. However, the AlphaZero training process requires tremendous computing resources, imposing additional difficulties for the AlphaZero-based AI. In this paper, we propose NoGoZero+ to improve the AlphaZero process and apply it to a game similar to Go, NoGo. NoGoZero+ employs several innovative features to improve training speed and performance, and most improvement strategies can be transferred to other nonspecific areas. This paper compares it with the original AlphaZero process, and results show that NoGoZero+ increases the training speed to about six times that of the original AlphaZero process. Moreover, in the experiment, our agent beat the original AlphaZero agent with a score of 81:19 after only being trained by 20,000 self-play games’ data (small in quantity compared with 120,000 self-play games’ data consumed by the original AlphaZero). The NoGo game program based on NoGoZero+ was the runner-up in the 2020 China Computer Game Championship (CCGC) with limited resources, defeating many AlphaZero-based programs. Our code, pretrained models, and self-play datasets are publicly available. The ultimate goal of this paper is to provide exploratory insights and mature auxiliary tools to enable AI researchers and computer-game communities to study, test, and improve these promising state-of-the-art methods at a much lower cost of computing resources.