NEBULA Singh, Sonali; Sarma, Anup; Jao, Nicholas ...
2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA),
05/2020
Conference Proceeding
Brain-inspired cognitive computing has so far followed two major approaches - one uses multi-layered artificial neural networks (ANNs) to perform pattern-recognition-related tasks, whereas the other ...uses spiking neural networks (SNNs) to emulate biological neurons in an attempt to be as efficient and fault-tolerant as the brain. While there has been considerable progress in the former area due to a combination of effective training algorithms and acceleration platforms, the latter is still in its infancy due to the lack of both. SNNs have a distinct advantage over their ANN counterparts in that they are capable of operating in an event-driven manner, thus consuming very low power. Several recent efforts have proposed various SNN hardware design alternatives, however, these designs still incur considerable energy overheads.
In this context, this paper proposes a comprehensive design spanning across the device, circuit, architecture and algorithm levels to build an ultra low-power architecture for SNN and ANN inference. For this, we use spintronics-based magnetic tunnel junction (MTJ) devices that have been shown to function as both neuro-synaptic crossbars as well as thresholding neurons and can operate at ultra low voltage and current levels. Using this MTJ-based neuron model and synaptic connections, we design a low power chip that has the flexibility to be deployed for inference of SNNs, ANNs as well as a combination of SNN-ANN hybrid networks - a distinct advantage compared to prior works. We demonstrate the competitive performance and energy efficiency of the SNNs as well as hybrid models on a suite of workloads. Our evaluations show that the proposed design, NEBULA, is up to 7.9x more energy efficient than a state-of-the-art design, ISAAC, in the ANN mode. In the SNN mode, our design is about 45x more energy-efficient than a contemporary SNN architecture, INXS. Power comparison between NEBULA ANN and SNN modes indicates that the latter is at least 6.25x more power-efficient for the observed benchmarks.
We perform a simulation-based analysis on the potential of emerging ferroelectric tunnel junctions (FTJs) as a memory device for crossbar arrays. Though FTJs are promising due to their low power ...switching characteristics compared to other emerging technologies, the greatest challenge for FTJs is the tradeoff between integration density and read performance. Our analysis highlights the need to co-optimize the ferroelectric thickness of the FTJ and read/write voltages to achieve proper functionality at large array sizes. Our analysis shows that FTJ-based crossbar achieves 93% higher sense margin at isoread power of 116 nW (per bit), but this FTJ design comes at a cost of <inline-formula> <tex-math notation="LaTeX">9.28\times </tex-math></inline-formula> higher write power at isowrite time of 250 ns. In response, we study the potential tradeoffs of design points outside the feasible region to understand what device characteristics are desired to overcome such challenges.
Conventional processors suffer from high access latency and power dissipation due to the demand for memory bandwidth for data-intensive workloads, such as machine learning and analytic. In-memory ...computing support for various memory technologies has provided formidable improvement in performance and energy for such workloads, alleviating the repeated accesses and data movement between CPU and storage. While many processing in-memory (PIM) works have been proposed to efficiently compute dot products using Kirchoff's law, such solutions are unsuitable for many analytic workloads where working data is too large and too sparse to efficiently store in memory. This article closely focuses on the peripheral circuit design for diode-selected crossbars and configures the compute-embedded fabric to efficiently compute sparse matrix-vector multiplication (SpMV). On average, our proposed end-to-end SpMV accelerator achieves <inline-formula> <tex-math notation="LaTeX">7.7\times </tex-math></inline-formula> speed up and <inline-formula> <tex-math notation="LaTeX">4.9\times </tex-math></inline-formula> energy-savings compared to the state-of-the-art Fulcrum.
We present an extensive analysis of functional-oxide based selector devices for cross-point memories from the perspectives of materials through arrays. We describe the design constraints required for ...proper functionality of a cross-point array and translate these constraints to figures of merit for the selector materials. The proposed figures of merit, related to the resistivities of the functional oxide in the metallic and insulating states and the critical current densities for insulator-metal transitions, determine whether or not a functional oxide is suitable to be employed as a selector for a memory technology. Our analysis shows the importance of co-optimizing the selector length with the read/write voltages and establishes the range of these parameters for proper functionality. We also perform an extensive material space analysis for the selector, relating the selector properties to the achievable array metrics. For instance, we show that optimized memory array with single crystal VO 2 based selector and spin-memory element achieves ~ 25μA sense margin with ~ 30% read disturb margin and 40ns write time. The leakage in the half-accessed cell can be as low as 15μW. The design principles established in this work will provide guidelines for future exploration of functional oxides for selector applications as well as for the optimization of cross-point arrays.
There is an ongoing trend to increasingly offload inference tasks, such as CNNs, to edge devices in many IoT scenarios. As energy harvesting is an attractive IoT power source, recent ReRAM-based CNN ...accelerators have been designed for operation on harvested energy. When addressing the instability problems of harvested energy, prior optimization techniques often assume that the load is fixed, overlooking the close interactions among input power, computational load, and circuit efficiency, or adapt the dynamic load to match the just-in-time incoming power under a simple harvesting architecture with no intermediate energy storage. Targeting a more efficient harvesting architecture equipped with both energy storage and energy delivery modules, this paper is the first effort to target whole system, end-to-end efficiency for an energy harvesting ReRAM-based accelerator. First, we model the relationships among ReRAM load power, DC-DC converter efficiency, and power failure overhead. Then, a maximum computation progress tracking scheme (MaxTracker) is proposed to achieve a joint optimization of the whole system by tuning the load power of the ReRAM-based accelerator. Specifically, MaxTracker accommodates both continuous and intermittent computing schemes and provides dynamic ReRAM load according to harvesting scenarios. We evaluate MaxTracker over four input power scenarios, and the experimental results show average speedups of 38.4%/40.3% (up to 51.3%/84.4%), over a full activation scheme (with energy storage) and order-of-magnitude speedups over the recently proposed (energy storage-less) ResiRCA technique. Furthermore, we also explore MaxTracker in combination with the Capybara reconfigurable capacitor approach to offer more flexible tuners and thus further boost the system performance.
Recently, Memory Augmented Neural Networks (MANN)s, a class of Deep Neural Networks (DNN)s have become prominent owing to their ability to capture the long term dependencies effectively for several ...Natural Language Processing (NLP) tasks. These networks augment conventional DNNs by incorporating memory and attention mechanisms external to the network to capture relevant information. Several MANN architectures have shown particular benefits in NLP tasks by augmenting an underlying Recurrent Neural Network (RNN) with external memory using attention mechanisms. Unlike conventional DNNs whose computational time is dominated by MAC operations, MANNs have more diverse behavior. In addition to MACs, the attention mechanisms of MANNs also consist of operations such as similarity measure, sorting, weighted memory access, and pair-wise arithmetic. Due to this greater diversity in operations, MANNs are not trivially accelerated by the same techniques used by existing DNN accelerators. In this work, we present an end-to-end hardware accelerator architecture,
FARM
, for the inference of RNNs and several variants of MANNs, such as the
Differential Neural Computer
(DNC),
Neural Turing Machine
(NTM) and
Meta-learning model
. FARM achieves an average speedup of 30x-190x and 80x-100x over CPU and GPU implementations, respectively. To address remaining memory bottlenecks in FARM, we then propose the FARM-PIM architecture, which augments FARM with in-memory compute support for MAC and content-similarity operations in order to reduce data traversal costs. FARM-PIM offers an additional speedup of 1.5x compared to FARM. Additionally, we consider an efficiency-oriented version of the PIM implementation, FARM-PIM-LP, that trades a 20% performance reduction relative to FARM for a 4x average power consumption reduction.
Many recent works have shown substantial efficiency boosts from performing inference tasks on Internet of Things (IoT) nodes rather than merely transmitting raw sensor data. However, such tasks, ...e.g., convolutional neural networks (CNNs), are very compute intensive. They are therefore challenging to complete at sensing-matched latencies in ultra-low-power and energy-harvesting IoT nodes. ReRAM crossbar-based accelerators (RCAs) are an ideal candidate to perform the dominant multiplication-and-accumulation (MAC) operations in CNNs efficiently, but conventional, performance-oriented RCAs, while energy-efficient, are power hungry and ill-optimized for the intermittent and unstable power supply of energy-harvesting IoT nodes. This paper presents the ResiRCA architecture that integrates a new, lightweight, and configurable RCA suitable for energy harvesting environments as an opportunistically executing augmentation to a baseline sense-and-transmit battery-powered IoT node. To maximize ResiRCA throughput under different power levels, we develop the ResiSchedule approach for dynamic RCA reconfiguration. The proposed approach uses loop tiling-based computation decomposition, model duplication within the RCA, and inter-layer pipelining to reduce RCA activation thresholds and more closely track execution costs with dynamic power income. Experimental results show that ResiRCA together with ResiSchedule achieve average speedups and energy efficiency improvements of 8x and 14x respectively compared to a baseline RCA with intermittency-unaware scheduling.
With data volume growing exponentially in today's era, modern computing systems are increasingly bottlenecked and consistently burdened by the costs of data movement. Driven by the development of ...emerging non-volatile memory (NVM) technologies and by the increasing demand for high throughput in big data applications, considerable research effort has gone into embedding computing in memory and exploiting parallelism in data-intensive workloads to address the "memory wall" bottleneck. In this work, we propose a non-volatile memory design which leverages run-time reconfigurability of peripheral circuits to perform various in-memory computations like that of a field-programmable gate array (FPGA). Our architecture allows this intelligent storage system to operate as both a main memory and an accelerator for memory-intensive applications such as matrix multiplication, database query and artificial neural networks.
Cross-point architecture, while being appealing in consideration of high integration density, suffers from leakage through sneak paths across the array. The leakage current flowing through ...half-accessed and in some cases, unaccessed cells (and the corresponding leakage power) are important determinants of array performance. Proper estimation of these components is computationally challenging and often demands rigorous simulation efforts. This paper presents a computationally efficient compact model to assess the leakage in cross-point array employing threshold switch selectors. We provide closed form mathematical expressions that govern our model and explain the derivation methodologies. We analyze and verify the validity of the model by cross-checking with results from conventional rigorous array simulations. The model shows excellent matching (~99% accuracy) with rigorous simulations for different array sizes (16×16 through 256×256). The model has been tested with various ranges of selector OFF resistance (0.1 MΩ to 1 GΩ), interconnect resistance (1 mΩ/□ to 10 Ω/□) and access voltage (0.2V to 1V). The test results from the model show accurate response in comparison with those obtained from intensive array simulations.
Monolithic 3D Enabled Processing-in- SRAM Memory Narayanan, Vijaykrishnan; Challapalle, Nagadastagiri; Okafor, Ikenna ...
2020 China Semiconductor Technology International Conference (CSTIC),
2020-June-26
Conference Proceeding
This work will provide an overview of recent advances in enabling SRAM-based compute fabrics leveraging monolithic 3D (M3D). It will highlight that the fine grain connectivity enabled by M3D, enables ...to embed computations close to the memory cells significantly reducing the data transfer costs. The application level benefits to emerging workloads will also be presented.