The game of chess is the longest-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific ...adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. By contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go by reinforcement learning from self-play. In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve superhuman performance in many challenging games. Starting from random play and given no domain knowledge except the game rules, AlphaZero convincingly defeated a world champion program in the games of chess and shogi (Japanese chess), as well as Go.
Network slicing is a key technology in 5G communications system. Its purpose is to dynamically and efficiently allocate resources for diversified services with distinct requirements over a common ...underlying physical infrastructure. Therein, demand-aware resource allocation is of significant importance to network slicing. In this paper, we consider a scenario that contains several slices in a radio access network with base stations that share the same physical resources (e.g., bandwidth or slots). We leverage deep reinforcement learning (DRL) to solve this problem by considering the varying service demands as the environment state and the allocated resources as the environment action . In order to reduce the effects of the annoying randomness and noise embedded in the received service level agreement (SLA) satisfaction ratio (SSR) and spectrum efficiency (SE), we primarily propose generative adversarial network-powered deep distributional Q network (GAN-DDQN) to learn the action-value distribution driven by minimizing the discrepancy between the estimated action-value distribution and the target action-value distribution. We put forward a reward-clipping mechanism to stabilize GAN-DDQN training against the effects of widely-spanning utility values. Moreover, we further develop Dueling GAN-DDQN, which uses a specially designed dueling generator, to learn the action-value distribution by estimating the state-value distribution and the action advantage function. Finally, we verify the performance of the proposed GAN-DDQN and Dueling GAN-DDQN algorithms through extensive simulations.
Functional communication training (FCT) is an evidence-based practice used to mitigate challenging behavior by increasing functional communication skills. To increase the practicality and feasibility ...of FCT in natural settings, thinning schedules of reinforcement are typically programmed following FCT. In this review, we meta-analyzed 28 studies that incorporated a thinning schedule procedure following FCT for 51 children with intellectual and developmental disabilities ages 8 and younger. Using Tau-U, the results demonstrated overall moderate effect sizes for both challenging behavior and functional communication responses. Additionally, moderator analyses pertaining to participant characteristics, interventions, and study quality were conducted. Thinning procedures were most effective for children who had stronger communication repertoire. Implications for future research and practice are discussed.
Five experiments used a magazine approach paradigm with rats to investigate whether learning about nonreinforcement is impaired in the presence of a conditioned stimulus (CS) that had been partially ...reinforced (PRf). Experiment 1 trained rats with a PRf CS and a continuously reinforced (CRf) CS, then extinguished responding to both CSs presented together as a compound. Probe trials of each CS presented alone revealed that extinction was slower for the PRf CS than the CRf CS, despite being extinguished in compound. In Experiment 2, a CRf light was extinguished in compound with either a CRf CS or a PRf CS that had been matched for overall reinforcement rate. Responding to the light extinguished at the same rate regardless of the reinforcement schedule of the other CS. Experiment 3 replicated this result with a PRf light. Thus, we found no evidence that a PRf CS impairs extinction of another CS presented at the same time. Experiments 4 and 5 extended this approach to study the acquisition of conditioned inhibition by training an inhibitor in compound with either a PRf or CRf excitatory CS. The reinforcement schedule of the excitatory CS had no effect on the acquisition of inhibition. In sum, conditioning with a PRf schedule slows subsequent extinction of that CS but does not affect learning about the nonreinforcement of other stimuli presented at the same time. We conclude that the Partial Reinforcement Extinction Effect is not attributable to a decrease in sensitivity to nonreinforcement following presentation of a PRf CS.
In this technical note, an online learning algorithm is developed to solve the linear quadratic tracking (LQT) problem for partially-unknown continuous-time systems. It is shown that the value ...function is quadratic in terms of the state of the system and the command generator. Based on this quadratic form, an LQT Bellman equation and an LQT algebraic Riccati equation (ARE) are derived to solve the LQT problem. The integral reinforcement learning technique is used to find the solution to the LQT ARE online and without requiring the knowledge of the system drift dynamics or the command generator dynamics. The convergence of the proposed online algorithm to the optimal control solution is verified. To show the efficiency of the proposed approach, a simulation example is provided.
•Roadways deformation with loose and fractured surrounding rock is large and difficult to control.•Plastic zone distributes in oval shape with different forces in horizontal and vertical ...direction.•Optimization of the anchor array pitch was numerically simulated by FLAC3D.•Whole section anchor–grouting reinforcement technology was successfully experimented.
Strata control technology for roadways with loose and fractured surrounding rock is one of the most challenging areas in underground roadway support. Based on the case of the serious deformations that occurred in a western main roadway of a Chinese coal mine, this paper analyzes the characteristics and influencing factors of deformation of roadways with loose and fractured surrounding rock. A mechanical model of the roadway’s surrounding rock has been established to study the different forces at play in the plastic zone of the roadway via means of site observation, theoretical analysis, numerical simulation and onsite experiments. Based on the double shell anchor–grouting reinforcement mechanism, whole section anchor–grouting reinforcement technology (WSAGRT) was implemented in the coal mine with the pertinent support parameters optimized by numerical simulation. The results show that the deformation of the rock surrounding the roadway has been held in check effectively, and thus WSAGRT warrants a safer and more effective mining environment.
This literature review summarises the influence of fibres on the main parameters governing corrosion of conventional reinforcement. The ability of fibres to suppress crack growth has proven to ...decrease permeation in cracked concrete while chloride diffusion, in uncracked concrete, seems to remain unaffected by the addition of fibres. Steel fibres in concrete are considered to be insulated owing to the high impedance of the passive layer. However, they will become conductive if they are depassivated. Although low carbon steel fibres may suffer severe corrosion when located near the concrete surface or bridging the cracks, embedded fibres will remain free of corrosion despite high chloride contents. Published experimental observations indicate that fibres had little influence on the corrosion rate of rebars. Steel fibres improved corrosion resistance of rebars moderately; this is mainly attributed to a reduced ingress of chlorides due to arrested crack growth.
•Exploration algorithms can be distinguished in terms of the bias and slope of choice functions.•Two experiments show evidence for both directed and random exploration.•A hybrid algorithm provides ...the best quantitative model of the choice data.
The dilemma between information gathering (exploration) and reward seeking (exploitation) is a fundamental problem for reinforcement learning agents. How humans resolve this dilemma is still an open question, because experiments have provided equivocal evidence about the underlying algorithms used by humans. We show that two families of algorithms can be distinguished in terms of how uncertainty affects exploration. Algorithms based on uncertainty bonuses predict a change in response bias as a function of uncertainty, whereas algorithms based on sampling predict a change in response slope. Two experiments provide evidence for both bias and slope changes, and computational modeling confirms that a hybrid model is the best quantitative account of the data.
Learning from successes and failures often improves the quality of subsequent decisions. Past outcomes, however, should not influence purely perceptual decisions after task acquisition is complete ...since these are designed so that only sensory evidence determines the correct choice. Yet, numerous studies report that outcomes can bias perceptual decisions, causing spurious changes in choice behavior without improving accuracy. Here we show that the effects of reward on perceptual decisions are principled: past rewards bias future choices specifically when previous choice was difficult and hence decision confidence was low. We identified this phenomenon in six datasets from four laboratories, across mice, rats, and humans, and sensory modalities from olfaction and audition to vision. We show that this choice-updating strategy can be explained by reinforcement learning models incorporating statistical decision confidence into their teaching signals. Thus, reinforcement learning mechanisms are continually engaged to produce systematic adjustments of choices even in well-learned perceptual decisions in order to optimize behavior in an uncertain world.
The partial reinforcement extinction effect (PREE) refers to the phenomenon that conditioned responding extinguishes more slowly if subjects had been inconsistently ("partially") reinforced than if ...they had been reinforced on every trial ("continuously" reinforced). One largely successful account of the PREE, known as sequential theory (Capaldi, 1966), suggests that, when subjects are partially reinforced, they learn that memories of sequences of nonreinforced trials are associated with subsequent reinforcement. This association helps to maintain responding (i.e., delay extinction) when the subjects experience nonreinforced trials during extinction. Sequential theory's explanation of the PREE hinges on subjects learning sequences of nonreinforced trials during acquisition. However, direct evidence for such sequential learning is not available in previous studies of the PREE where animals are trained with multiple sequences of different lengths that are randomly intermixed and, therefore, cannot anticipate whether a given trial will be reinforced during acquisition. The current study conducted two experiments that trained rats with a single fixed trial sequence to provide evidence of sequential learning during conditioning, and then observe its effect on the PREE. Under one condition the rats did learn about the fixed sequence but did not subsequently show a PREE, whereas other rats that did show a PREE had not learned the trial sequences during conditioning. Therefore, contrary to sequential theory's prediction, our result suggests that learning about the trial sequence is neither necessary nor sufficient for the PREE. We suggest that the PREE may instead depend on uncertainty about whether the conditioned stimulus will be reinforced.