The authors designed an accelerator architecture for large-scale neural networks, with an emphasis on the impact of memory on accelerator design, performance, and energy. In this article, they ...present a concrete design at 65 nm that can perform 496 16-bit fixed-point operations in parallel every 1.02 ns, that is, 452 gop/s, in a 3.02mm 2 , 485-mw footprint (excluding main memory accesses).
Regularized system identification of linear time invariant systems in the presence of outliers is investigated. The finite impulse response (fir) model and the Gaussian scale mixture are chosen to be ...the system model and the noise model, respectively. Two special cases of the noise model are considered: the well-known Student’s t distribution and a proposed G-confluent distribution. Both the fir model parameter and the latent variables in the noise model are treated as parameters of our statistical model and moreover, the scale of the noise variance is treated as a hyper-parameter besides the hyper-parameters used to parameterize the priors of the impulse response and the latent variables. Then a variational expectation–maximization algorithm is proposed for inference of the parameters and hyper-parameters of the statistical model, and the algorithm is guaranteed to converge to a stationary point. Monte Carlo numerical simulations show that when the relative size of outliers is small, the proposed approach performs comparably to a state-of-the-art method and when the relative size of outliers and/or the occurrence probability of outliers is large, the proposed approach outperforms the state-of-the-art method.
Reproducing kernel Hilbert spaces (RKHSs) have proved themselves to be key tools for the development of powerful machine learning algorithms, the so-called regularized kernel-based ...approaches.Recently, they have also inspired the design of new linear system identification techniques able to challenge classical parametric prediction error methods. These facts motivate the study of the RKHS theory within the control community. In this note, we focus on the characterization of stable RKHSs, i.e. RKHSs of functions representing stable impulse responses. Related to this, working in an abstract functional analysis framework, Carmeli et al. (2006) has provided conditions for an RKHS to be contained in the classical Lebesgue spaces ℒp. In particular, we specialize this analysis to the discrete-time case with p=1. The necessary and sufficient conditions for the stability of an RKHS are worked out by a quite simple proof, more easily accessible to the control community.
System identification is a mature research area with well established paradigms, mostly based on classical statistical methods. Recently, there has been considerable interest in so called ...kernel-based regularisation methods applied to system identification problem. The recent literature on this is extensive and at times difficult to digest. The purpose of this contribution is to provide an accessible account of the main ideas and results of kernel-based regularisation methods for system identification. The focus is to assess the impact of these new techniques on the field and traditional paradigms.
An Accelerator for High Efficient Vision Processing Zidong Du; Shaoli Liu; Fasthuber, Robert ...
IEEE transactions on computer-aided design of integrated circuits and systems,
2017-Feb., 2017-2-00, Volume:
36, Issue:
2
Journal Article
Peer reviewed
In recent years, neural network accelerators have been shown to achieve both high energy efficiency and high performance for a broad application scope within the important category of recognition and ...mining applications. Still, both the energy efficiency and performance of such accelerators remain limited by memory accesses. In this paper, we focus on image applications, arguably the most important category among recognition and mining applications. The neural networks which are state-of-the-art for these applications are convolutional neural networks (CNNs), and they have an important property: weights are shared among many neurons, considerably reducing the neural network memory footprint. This property allows to entirely map a CNN within an SRAM, eliminating all DRAM accesses for weights. By further hoisting this accelerator next to the image sensor, it is possible to eliminate all remaining DRAM accesses, i.e., for inputs and outputs. In this paper, we propose such a CNN accelerator, placed next to a CMOS or CCD sensor. The absence of DRAM accesses combined with a careful exploitation of the specific data access patterns within CNNs allows us to design an accelerator which is highly energy-efficient. We present a single-core implementation down to the layout at 65 nm, with a modest footprint of 5.94 mm 2 and consuming only 336 mW, but still about 30× faster than high-end GPUs. For visual processing with higher resolution and frame-rate requirements, we further present a multicore implementation with elevated performance.
A long-standing problem for kernel-based regularization methods is their high computational complexity O(N3), where N is the number of data points. In this paper, we make a breakthrough for this ...problem. In particular, we show that it is possible to design general semiseparable kernels through either the system theory perspective or the machine learning perspective, leading to semiseparable simulation-induced kernels or amplitude modulated locally stationary kernels, respectively. Moreover, for many frequently used test input signals in automatic control, and by exploring the semiseparable structure of a kernel and the corresponding output kernel, their computational complexity, without any approximations, can be lowered to O(Nq2) or O(Nq3), where q is the semiseparability rank of the output kernel that only depends on the chosen kernel and the input signal. Numerical simulation shows that the proposed implementation can be 104 times faster than a state of art implementation.
In this paper, a new particle filter (PF) which we refer to as the decentralized PF (DPF) is proposed. By first decomposing the state into two parts, the DPF splits the filtering problem into two ...nested subproblems and then handles the two nested subproblems using PFs. The DPF has the advantage over the regular PF that the DPF can increase the level of parallelism of the PF. In particular, part of the resampling in the DPF bears a parallel structure and can thus be implemented in parallel. The parallel structure of the DPF is created by decomposing the state space, differing from the parallel structure of the distributed PFs which is created by dividing the sample space. This difference results in a couple of unique features of the DPF in contrast with the existing distributed PFs. Simulation results of two examples indicate that the DPF has a potential to achieve in a shorter execution time the same level of performance as the regular PF.
When applied to the consensus tracking of repetitive leader-follower multiagent systems (MASs), most of existing distributed iterative learning control (DILC) methods assume that the dynamics of ...agents are exactly known or up to the affine form. In this article, we study a more general case where the dynamics of agents are unknown, nonlinear, nonaffine, and heterogeneous, and the communication topologies can be iteration-varying. More specifically, we first apply the controller-based dynamic linearization method in the iteration domain to obtain a parametric learning controller using only the local input-output data collected from neighboring agents in a directed graph, and then propose a data-driven distributed adaptive iterative learning control (DAILC) method through the parameter-adaptive learning methods. We show that for each time instant, the tracking error is ultimately bounded in the iteration domain for both of the cases with iteration-invariant and iteration-varying communication topologies. The simulation results show that the proposed DAILC method has faster convergence speed, higher tracking accuracy, and more robust learning and tracking in comparison with a typical DAILC method.
Network-on-Chip (NoC) is a promising replacement of bus architecture due to its better scalability. In state-of-the-art NoCs, each packet contains several fixed-length flits, which facilitates ...allocations of network resources but brings in many unused bits. In this paper, we propose a novel technique called Stealth-ACK to effectively address the above problem. Stealth-ACK leverages unused bits in head flits of non-ACK packets to carry and stealthily transmit ACK information. Such stealth transmissions of ACK information effectively reduce not only the amount of dedicated ACK packets on NoC, but also the number of unused bits in head flits of non-ACK packets, which significantly reduces wastes on NoC bandwidth. Experimental results show that Stealth-ACK averagely increases the throughput of 16 × 16 2-D mesh NoC by 11.9%, and averagely reduces the NoC latency by 34.8% on application traces of SPLASH-2. Moreover, Stealth-ACK only requires trivial hardware modification to basic router architectures, which incurs negligible power consumption and area cost.
Verifying the execution of a parallel program against a given memory consistency model (memory consistency verification) is a crucial problem in the functional validation of Chip Multiprocessor ...(CMP). In the absence of additional information, the above problem is known to be NP-hard. By adopting the pending period information, this paper proposes the first linear-time software-based approach to memory consistency verification. Our approach relies on a novel technique called reusable cycle checking, which reuses the previous order information when repeatedly checking cycle at different frontiers. In the context of pending period information, this technique significantly reduces the overall computational costs required by cycle checking, enabling linear-time (in the number of memory operations) memory consistency verification for any given multicore system with a constant number of processors. From a practical perspective, an industrial memory consistency verification tool, named XCHECK, has been developed based on our approach. XCHECK is capable of working with neither test program constraint nor dedicated hardware support in postsilicon verifications of many multiprocessor systems. Experimental results show that XCHECK is 3-10 times faster than a state-of-art software-based approach. XCHECK has been integrated into the verification platforms for an industrial multicore processor Godson-3B, and found several bugs of the design.