In deep reinforcement learning, experience replay is usually used to improve data efficiency and alleviate experience forgetting. However, online reinforcement learning is often influenced by the ...index of experience, which usually makes the phenomenon of unbalanced sampling. In addition, most experience replay methods ignore the differences among experiences, and cannot make full use of all experiences. Especially many “near”-policy experiences relatively relevant to the current policy are wasted, despite of the fact that they are beneficial for improving sample efficiency. This paper theoretically analyzes the influence of various factors on experience sampling, and then proposes a sampling method for experience replay based on frequency and similarity (FSER) to alleviate unbalanced sampling and increase the value of the sampled experiences. FSER prefers experiences that are rarely sampled or highly relevant to the current policy. FSER plays a critical role to balance the experience forgetting and wasting problems. Finally, FSER is combined with TD3 to achieve the state-of-the-art results in multiple tasks.
•This paper formulates the sampling probability, and analyzes the monotonic influence.•Three optional sampling strategies based on sampling frequency are designed.•A sampling strategy based on similarity is designed.•FSER (ER based on frequency and similarity) algorithm is proposed.•FSER outperforms the state-of-the-art results on multiple tasks.
Reinforcement learning, evolutionary algorithms and imitation learning are three principal methods to deal with continuous control tasks. Reinforcement learning is sample efficient, yet sensitive to ...hyperparameters settings and needs efficient exploration; Evolutionary algorithms are stable, but with low sample efficiency; Imitation learning is both sample efficient and stable, however it requires the guidance of expert data. In this paper, we propose Recruitment-imitation Mechanism (RIM) for evolutionary reinforcement learning, a scalable framework that combines advantages of the three methods mentioned above. The core of this framework is a dual-actors and single critic reinforcement learning agent. This agent can recruit high-fitness actors from the population performing evolutionary algorithms, which instructs itself to learn from experience replay buffer. At the same time, low-fitness actors in the evolutionary population can imitate behavior patterns of the reinforcement learning agent and promote their fitness level. Reinforcement and imitation learners in this framework can be replaced with any off-policy actor-critic reinforcement learner and data-driven imitation learner. We evaluate RIM on a series of benchmarks for continuous control tasks in Mujoco. The experimental results show that RIM outperforms prior evolutionary or reinforcement learning methods. The performance of RIM’s components is significantly better than components of previous evolutionary reinforcement learning algorithm, and the recruitment using soft update enables reinforcement learning agent to learn faster than that using hard update.
To mitigate the distribution difference between the source and target domains, there have been many unsupervised domain adaptation methods to achieve class-level alignment by aligning the prototypes ...of two domains. Since the labels of the target domain are unobserved, the target prototypes are constructed based on pseudo-labels. However, the inaccuracy of pseudo-labels may lead to biased prototypes, which introduce noise into the distribution alignment. Moreover, a shared feature extractor or two separate feature extractors are used to extract domain-invariant features. But the former is limited by the large domain gap and the latter increases the network parameters. To this end, we propose a Softmax-Based Prototype construction and Adaptation (SBPA) method, which constructs prototypes based on the softmax output of the classifier instead of ground-truth labels or pseudo-labels. SBPA performs domain-level alignment through adversarial training, and class-level alignment by aligning prototypes of the same class. In addition, SBPA contains a residual block to explicitly model the difference between the source and target domain features extracted by a shared feature extractor. We evaluate our method on four widely used datasets, and the results show that our method outperforms recent domain adaptation methods, especially on DomainNet, the hardest domain adaptation dataset by far.
This paper aims to solve sample inefficiency in Asynchronous Advantage Actor-Critic (A3C). First, we design a new off-policy actor-critic algorithm, which combines actor-critic with experience replay ...to improve sample efficiency. Next, we study the sampling method of experience replay for trajectory experience and propose a familiarity-based replay mechanism which uses the number of replay times of experience as the probability weight of sampling. Finally, we use the GAE-V method to correct the bias caused by off-policy learning. We also achieve better performance by adopting a mechanism that combines off-policy learning and on-policy learning to update the network. Our results on Atari and MuJoCo benchmarks show that each of these innovations contributes to improvements in both data efficiency and final performance. Furthermore, our approach keeps a fast coverage speed and the same parallel feature as A3C, and also has better performance on exploration.
Domain adaptation aims to mitigate the domain gap between the source and target domains so that knowledge can be transferred between domains. There are two key factors that determine the adaptation ...performance: transferability and discriminability. Transferability depends on the similarity of two domains. With transferability, the model learnt on the source domain can be used in the target domain. Discriminability indicates the separability of different classes. With discriminability, the adapted target features can be classified more accurately. Adversarial domain adaptation methods learn domain-invariant feature representations through adversarial learning. The domain-invariant feature representation guarantees the transferability. However, to obtain domain-invariant features, certain domain-specific information is suppressed, which may cause the loss of discriminability. To this end, we aim to enhance the discriminability by enriching the information contained in the domain-invariant features. We propose a Feature Concatenation for adversarial Domain Adaptation (FCDA) method. FCDA learns two feature extractors that can generate two different feature views for a sample. The concatenation of these two views is used as the feature representation of a sample, which we call the concatenation feature. Distribution alignment is performed on the concatenation features. We find that when the distributions of the concatenation features are aligned, the two feature views involved in a concatenation feature have different distributions. Thus, the concatenation feature contains more discriminative information, thereby enhancing the discriminative ability of the domain-invariant features. Experiments are carried out on four widely used datasets and FCDA exceeds some recent domain adaptation methods.
•The proposed adversarial domain adaptation method enhances the discriminability.•The proposed method represents a sample by concatenating two different views.•The consistency and complementarity of two views are guaranteed in both domains.•The model is optimized in an adversarial way.
Unsupervised domain adaptation (UDA) is used to extend the model working on well-annotated source data to unlabeled target data. However, in practice, due to privacy and storage issues, we can only ...obtain the well-trained source model. In this paper, we focus on this scenario named source-free domain adaptation (SFDA). At present, nearest neighbors-based SFDA methods assume that the target features extracted by the source model can form clear clusters, and align samples with their neighbors. However, due to the domain discrepancy, adjacent features may belong to different categories. We propose consistency regularization-based mutual alignment (CRMA) to address this problem. Firstly, we randomly augment each target sample. Due to the domain discrepancy, it may lead to negative transfer if we align them directly. Therefore, secondly, we leverage the information maximization loss to all target and augmented samples, improving the performance of mutual alignment. Finally, we mutually align original samples and augmented samples. It improves the ability of the model and increases the variety of samples to alleviate the phenomenon that incorrectly aligning samples when aligning with neighbors. CRMA achieves state-of-the-art performance on 3 popular cross-domain benchmarks. Compared with the original method, CRMA has improvements of 0.4% up to 89.4%, 1.9% up to 72.2%, and 1.9% up to 85.9% on 3 datasets respectively. At the last, we verify the effectiveness of each part of CRMA through ablation experiments and use a series of experiments to analyze CRMA in detail.
•We propose the mutual alignment strategy.•We design the mutual alignment loss based on entropy.•We leverage information maximization and entropy to enhance predictions.•We conduct a series of experiments to demonstrate the effectiveness of CRMA.•Our method outperforms the SOTA methods on three popular benchmarks.
This paper proposes a new energy storage system (ESS) design, including both batteries and ultracapacitors (UCs) in hybrid electric vehicle (HEV) and electric vehicle applications. The conventional ...designs require a DC-DC converter to interface the UC unit. Herein, the UC can be directly switched across the motor drive DC link during the peak power demands. The resulting wide voltage variation due to UC power transfer is addressed by the simple modulator that is introduced in this paper, so that the motor drive performance is not disrupted. Based on this new methodology, this paper further introduces two ESS schemes with different topologies, namely 1) UC rating and 2) energy flow control. They are applicable to both lightly and heavily hybridized HEVs. Both schemes have the benefits of high efficiency (without a DC-DC link) and low cost. The simulation and experimental results validate the new methodology.
Although Ru(II)-based agents are expected to be promising candidates for substituting Pt-drug, their in vivo biomedical applications are still limited by the short excitation/emission wavelengths and ...unsatisfactory therapeutic efficiency. Herein, we rationally design a Ru(II) metallacycle with excitation at 808 nm and emission over 1000 nm, namely Ru1085, which holds deep optical penetration (up to 6 mm) and enhanced chemo-phototherapy activity. In vitro studies indicate that Ru1085 exhibits prominent cell uptake and desirable anticancer capability against various cancer cell lines, especially for cisplatin-resistant A549 cells. Further studies reveal Ru1085 induces mitochondria-mediated apoptosis along with S and G2/M phase cell cycle arrest. Finally, Ru1085 shows precise NIR-II fluorescence imaging guided and long-term monitored chemo-phototherapy against A549 tumor with minimal side effects. We envision that the design of long-wavelength emissive metallacycle will offer emerging opportunities of metal-based agents for in vivo biomedical applications.
Deep Deterministic Policy Gradient (DDPG) algorithm is one of the most well-known reinforcement learning methods. However, this method is inefficient and unstable in practical applications. On the ...other hand, the bias and variance of the Q estimation in the target function are sometimes difficult to control. This paper proposes a Regularly Updated Deterministic (RUD) policy gradient algorithm for these problems. This paper theoretically proves that the learning procedure with RUD can make better use of new data in replay buffer than the traditional procedure. In addition, the low variance of the Q value in RUD is more suitable for the current Clipped Double Q-learning strategy. This paper has designed a comparison experiment against previous methods, an ablation experiment with the original DDPG, and other analytical experiments in Mujoco environments. The experimental results demonstrate the effectiveness and superiority of RUD.