Silicon-based static random access memories (SRAM) and digital Boolean logic have been the workhorse of the state-of-the-art computing platforms. Despite tremendous strides in scaling the ubiquitous ...metal-oxide-semiconductor transistor, the underlying von-Neumann computing architecture has remained unchanged. The limited throughput and energy-efficiency of the state-of-the-art computing systems, to a large extent, result from the well-known von-Neumann bottleneck . The energy and throughput inefficiency of the von-Neumann machines have been accentuated in recent times due to the present emphasis on data-intensive applications such as artificial intelligence, machine learning, and cryptography. A possible approach towards mitigating the overhead associated with the von-Neumann bottleneck is to enable in-memory Boolean computations. In this paper, we present an augmented version of the conventional SRAM bit-cells, called the X-SRAM , with the ability to perform in-memory, vector Boolean computations, in addition to the usual memory storage operations. We propose at least six different schemes for enabling in-memory vector computations, including NAND, NOR, IMP (implication), XOR logic gates, with respect to different bit-cell topologies − the 8T cell and the 8 + T Differential cell. In addition, we also present a novel 'read-compute-store' scheme, wherein the computed Boolean function can be directly stored in the memory without the need of latching the data and carrying out a subsequent write operation. The feasibility of the proposed schemes has been verified using the predictive transistor models and detailed Monte-Carlo variation analysis. As an illustration, we also present the efficacy of the proposed in-memory computations by implementing advanced encryption standard algorithm on a non-standard von-Neumann machine wherein the conventional SRAM is replaced by X-SRAM. Our simulations indicated that up to 75% of memory accesses can be saved using the proposed techniques.
Deep neural networks are biologically inspired class of algorithms that have recently demonstrated the state-of-the-art accuracy in large-scale classification and recognition tasks. Hardware ...acceleration of deep networks is of paramount importance to ensure their ubiquitous presence in future computing platforms. Indeed, a major landmark that enables efficient hardware accelerators for deep networks is the recent advances from the machine learning community that have demonstrated the viability of aggressively scaled deep binary networks. In this paper, we demonstrate how deep binary networks can be accelerated in modified von Neumann machines by enabling binary convolutions within the static random access memory (SRAM) arrays. In general, binary convolutions consist of bit-wise exclusive-NOR (XNOR) operations followed by a population count (popcount). We present two proposals: one based on charge sharing approach to perform vector XNOR and approximate popcount and another based on bit-wise XNOR followed by a digital bit-tree adder for accurate popcount. We highlight the various tradeoffs in terms of circuit complexity, speed-up, and classification accuracy for both the approaches. Few key techniques presented as a part of the manuscript are the use of low-precision, low-overhead analog-to-digital converter (ADC), to achieve a fairly accurate popcount for the charge-sharing scheme and proposal for sectioning of the SRAM array by adding switches onto the read-bitlines, thereby achieving improved parallelism. Our results on benchmark image classification datasets for CIFAR-10 and SVHN on a binarized neural network architecture show energy improvements of up to <inline-formula> <tex-math notation="LaTeX">6.1\times </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">2.3\times </tex-math></inline-formula> for the two proposals, compared to conventional SRAM banks. In terms of latency, improvements of up to <inline-formula> <tex-math notation="LaTeX">15.8\times </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">8.1\times </tex-math></inline-formula> were achieved for the two respective proposals.
Recently, the exponential increase in compute requirements demanded by emerging applications like artificial intelligence, Internet of things, etc. have rendered the state-of-art von-Neumann machines ...inefficient in terms of energy and throughput owing to the well-known von-Neumann bottleneck. A promising approach to mitigate the bottleneck is to do computations as close to the memory units as possible. One extreme possibility is to do in-situ Boolean logic computations by using stateful devices. Stateful devices are those that can act both as a compute engine and storage device, simultaneously. We propose such stateful, vector, in-memory operations using voltage controlled magnetic anisotropy (VCMA) effect in magnetic tunnel junctions (MTJ). Our proposal is based on the well known manufacturable 1-transistor - 1-MTJ bit-cell and does not require any modifications in the bit-cell circuit or the magnetic device. Instead, we leverage the very physics of the VCMA effect to enable stateful computations. Specifically, we exploit the voltage asymmetry of the VCMA effect to construct stateful IMP (implication) gate and use the precessional switching dynamics of the VCMA devices to propose a massively parallel NOT operation. Further, we show that other gates like AND, OR, NAND, NOR, NIMP (complement of implication) can be implemented using multi-cycle operations.
Achieving multi-level devices is crucial to efficiently emulate key bio-plausible functionalities such as synaptic plasticity and neuronal activity, and has become an important aspect of neuromorphic ...hardware development. In this review article, we focus on various ferromagnetic (FM) and ferroelectric (FE) devices capable of representing multiple states, and discuss the usage of such multi-level devices for implementing neuromorphic functionalities. We will elaborate that the analog-like resistive states in ferromagnetic or ferroelectric thin films are due to the non-coherent multi-domain switching dynamics, which is fundamentally different from most memristive materials involving electroforming processes or significant ion motion. Both device fundamentals related to the mechanism of introducing multilevel states and exemplary implementations of neural functionalities built on various device structures are highlighted. In light of the non-destructive nature and the relatively simple physical process of multi-domain switching, we envision that ferroic-based multi-state devices provide an alternative pathway toward energy efficient implementation of neuro-inspired computing hardware with potential advantages of high endurance and controllability.
In this era of nanoscale technologies, the inherent characteristics of some nonvolatile devices, such as resistive random access memory (ReRAM), phase-change material (PCM), and spintronics, can ...emulate stochastic functionalities. Traditionally, these devices have been engineered to suppress the stochastic switching behavior as it poses reliability concerns for memory storage and logic applications. However, leveraging stochasticity in such devices led to a renewed interest in hardware-software codesign of stochastic algorithms since the CMOS-based implementations of stochastic algorithms involve cumbersome circuitry to generate "stochastic bits." In this article, we consider two classes of problems: deep neural networks (DNNs) and combinatorial optimization. The rapidly growing demands of artificial intelligence (AI) have sparked an interest in energy-efficient implementations of large DNNs, with binary representations of synaptic weights and neuronal activities. Stochasticity plays an important role in leveraging the benefits of these binary representations, leading to model compression and optimization during training. In combinatorial optimization, such as graph coloring or traveling salesman problems, stochastic algorithms, such as the Ising computing model, have been shown to be effective. These problems require exhaustive computational procedures, and the Ising model uses a natural annealing agent to achieve near-optimal solutions in a reasonable timescale, without getting stuck in "local minima." In this article, we present a broad review of stochastic computing utilizing the stochastic switching characteristics of devices based on nanoscale nonvolatile technologies. We show how to codesign of the devices and algorithms that can enable optimal solutions for both combinatorial problems and binary neural networks for local learning and inference. Directly mapping the nonvolatile device characteristics to the stochastic algorithms without the need for storing the bits in a separate memory leads to efficient use of hardware.
Large-scale digital computing almost exclusively relies on the von Neumann architecture, which comprises separate units for storage and computations. The energy-expensive transfer of data from the ...memory units to the computing cores results in the well-known von Neumann bottleneck. Various approaches aimed toward bypassing the von Neumann bottleneck are being extensively explored in the literature. These include in-memory computing based on CMOS and beyond CMOS technologies, wherein by making modifications to the memory array, vector computations can be carried out as close to the memory units as possible. Interestingly, in-memory techniques based on CMOS technology are of special importance due to the ubiquitous presence of field-effect transistors and the resultant ease of large-scale manufacturing and commercialization. On the other hand, perhaps the most important computation required for applications such as machine learning, etc., comprises the dot-product operation. Emerging nonvolatile memristive technologies have been shown to be very efficient in computing analog dot products in an in situ fashion. The memristive analog computation of the dot product results in much faster operation as opposed to digital vector in-memory bitwise Boolean computations. However, challenges with respect to large-scale manufacturing coupled with the limited endurance of memristors have hindered rapid commercialization of memristive-based computing solutions. In this paper, we show that the standard 8 transistor (8T) digital SRAM array can be configured as an analoglike in-memory multibit dot-product engine (DPE). By applying appropriate analog voltages to the read ports of the 8T SRAM array and sensing the output current, an approximate analog-digital DPE can be implemented. We present two different configurations for enabling multibit dot-product computations in the 8T SRAM cell array, without modifying the standard bit-cell structure. We also demonstrate the robustness of the present proposal in presence of nonidealities such as the effect of line resistances and transistor threshold voltage variations. Since our proposal preserves the standard 8T-SRAM array structure, it can be used as a storage element with standard read-write instructions and also as an on-demand analoglike dot-product accelerator.
'In-memory computing' is being widely explored as a novel computing paradigm to mitigate the well known memory bottleneck. This emerging paradigm aims at embedding some aspects of computations inside ...the memory array, thereby avoiding frequent and expensive movement of data between the compute unit and the storage memory. In-memory computing with respect to Silicon memories has been widely explored on various memory bit-cells. Embedding computation inside the 6 transistor (6T) SRAM array is of special interest since it is the most widely used on-chip memory. In this paper, we present a novel in-memory multiplication followed by accumulation operation capable of performing parallel dot products within 6T SRAM without any changes to the standard bitcell. We, further, study the effect of circuit non-idealities and process variations on the accuracy of the LeNet-5 and VGG neural network architectures against the MNIST and CIFAR-10 datasets, respectively. The proposed in-memory dot-product mechanism achieves 88.8% and 99% accuracy for the CIFAR-10 and MNIST, respectively. Compared to the standard von Neumann system, the proposed system is <inline-formula> <tex-math notation="LaTeX">6.24\times </tex-math></inline-formula> better in energy consumption and <inline-formula> <tex-math notation="LaTeX">9.42\times </tex-math></inline-formula> better in delay.
From the little we know about the human brain, the inherent cognitive mechanism is very different from the de facto state-of-the-art computing platforms. The human brain uses distributed, yet ...integrated memory and computation units, unlike the physically separate memory and computation cores in typical von Neumann architectures. Despite huge success of artificial intelligence, hardware systems running these algorithms consume orders of magnitude higher energy compared to the human brain, mainly due to heavy data movements between the memory unit and the computation cores. Spiking neural networks (SNNs) built using bio-plausible neuron and synaptic models have emerged as the power efficient choice for designing cognitive applications. These algorithms involve several lookup-table (LUT) based function evaluations such as high-order polynomials and transcendental functions for solving complex neuro-synaptic models, that typically require additional storage and thus, bigger memories. To that effect, we propose ‘SPARE’–an in-memory, distributed processing architecture built on ROM-embedded RAM technology, for accelerating SNNs. ROM-embedded RAMs allow storage of LUTs (for neuro-synaptic models), embedded within a typical memory array, without additional area overhead. Our proposed architecture consists of a 2-D array of Processing Elements (PEs), wherein each PE has its own ROM-embedded RAM structure and executes part of the SNN computation. Since most of the computations (including multiple math-table evaluations) are done locally within each PE, unnecessary data transfers are restricted, thereby alleviating the problems arising due to physically separate remote memory unit and the computation core. SPARE thus leverages both, the hardware benefits of distributed, in-memory processing, and also the algorithmic benefits of SNNs. We evaluate SPARE for two different ROM-Embedded RAM structures–CMOS based ROM-Embedded SRAMs (R-SRAMs) and STT-MRAM based ROM-Embedded MRAMs (R-MRAMs). Moreover, we analyze trade-offs in terms of energy, area and performance, for using the two technologies on a range of image classification benchmarks. Furthermore, we leverage the additional storage density to implement complex neuro-synaptic functionalities. This enhances the utility of the proposed architecture by provisioning implementation of any neuron/synaptic behavior as necessitated by the application. Our results show up-to \sim\! 1.75\times∼1.75×, \sim\! 1.95\times∼1.95× and \sim \!1.95\times∼1.95× improvement in energy, iso-storage area, and iso-area performance, respectively, by using neural network accelerators built on ROM-embedded RAM primitives.
Although an average human brain might not be able to compete with modern day computers in performing arithmetic operations, when it comes to recognition and classification tasks, biological systems ...are clear winners in terms of performance and energy efficiency. Building blocks of all such biological systems are neurons and synapses. In order to exploit the benefits of such systems, novel devices are being explored to mimic the behavior of neurons and synapses. We propose a leaky-integrate-fire (LIF) neuron using the physics of automotion in magnetic domain walls (DWs). Due to the shape anisotropy in a high-aspect ratio magnet, DW has a tendency to move automatically, without any external driving force. This property can be exploited to mimic the realistic dynamics of spiking neurons, without any extra energy penalty. We analyze the dynamics of a DW under automotion and show that it can be approximated to mimic the LIF neuronal dynamics. We propose a compact, energy-efficient magnetic neuron, which can directly be cascaded to memristive crossbar array of synapses, thereby evading additional interfacing circuitry. Furthermore, we develop a device-to-system-level behavioral model to underscore the applicability of the proposal in a typical handwritten-digit recognition application.
Machine learning applications, especially deep neural networks (DNNs) have seen ubiquitous use in computer vision, speech recognition, and robotics. However, the growing complexity of DNN models have ...necessitated efficient hardware implementations. The key compute primitives of DNNs are matrix vector multiplications, which lead to significant data movement between memory and processing units in today's von Neumann systems. A promising alternative would be colocating memory and processing elements, which can be further extended to performing computations inside the memory itself. We believe in-memory computing is a propitious candidate for future DNN accelerators, since it mitigates the memory wall bottleneck. In this article, we discuss various in-memory computing primitives in both CMOS and emerging nonvolatile memory (NVM) technologies. Subsequently, we describe how such primitives can be incorporated in standalone machine learning accelerator architectures. Finally, we analyze the challenges associated with designing such in-memory computing accelerators and explore future opportunities.