Convolutional neural networks (CNNs) play a key role in deep learning applications. However, the large storage overheads and the substantial computational cost of CNNs are problematic in hardware ...accelerators. Computing-in-memory (CIM) architecture has demonstrated great potential to effectively compute large-scale matrix-vector multiplication. However, the intensive multiply and accumulation (MAC) operations executed on CIM macros remain bottlenecks for further improvement of energy efficiency and throughput. To reduce computational costs, model compression is a widely studied method to shrink the model size. For implementation in a static random access memory (SRAM) CIM-based accelerator, the model compression algorithm must consider the hardware limitations of CIM macros. In this study, a software and hardware co-design approach is proposed to design MARS, a SRAM-based CIM (SRAM CIM)-based CNN accelerator that can utilize multiple SRAM CIM macros as processing units and support a sparse CNN, and an SRAM CIM-aware model compression algorithm that considers a CIM architecture to reduce the number of network parameters. With the proposed hardware software co-designed method, MARS can reach over 700 and 400 FPS for CIFAR-10 and CIFAR-100, respectively. In addition, MARS achieves 52.3 and 88.2 TOPs/W in VGG16 and ResNet18, respectively.
Nonvolatile computing-in-memory (nvCIM) exhibits high potential for neuromorphic computing involving massive parallel computations and for achieving high energy efficiency. nvCIM is especially ...suitable for deep neural networks, which are required to perform large amounts of matrix-vector multiplications. However, a comprehensive quantization algorithm has yet to be developed, which overcomes the hardware limitations of resistive random access memory (ReRAM)-based nvCIM, such as the number of I/O, word lines (WLs), and ADC outputs. In this article, we propose a quantization training method for compressing deep models. The method comprises three steps: input and weight quantization, ReRAM convolution (ReConv), and ADC quantization. ADC quantization optimizes the error sampling problem by using the Gumbel-softmax trick. Under a 4-bit ADC of nvCIM, the accuracy only decreases by 0.05% and 1.31% for the MNIST and CIFAR-10, respectively, compared with the corresponding accuracies obtained under an ideal ADC. The experimental results indicate that the proposed method is effective for compensating the hardware limitations of nvCIM macros.
Advanced AI edge chips require multibit input (IN), weight (W), and output (OUT) for CNN multiply-and-accumulate (MAC) operations to achieve an inference accuracy that is sufficient for practical ...applications. Computing-in-memory (CIM) is an attractive approach to improve the energy efficiency (EFMAC of MAC operations under a memory-wall constraint. Previous SRAM-CIM macros demonstrated a binary MAC 4, an in-array 8b W-merging with near-memory computing (NMC) using 6T SRAM cells (limited output precision) 5, a 7b1N-1 bW MAC using a 10T SRAM cell (large area) 3, an 4b1N-5bW MAC with a T8T SRAM cell 1, and 8b1N-1bW NMC with 8T SRAM (long MAC latency (TAC)) 2. However, previous works have not achieved high IN/W/OUT precision with fast TAC compact-area, high EFMAC, and robust readout against process variation, due to (1) small sensing margin in word-wise multiple-bit MAC operations, (2) a tradeoff between read accuracy vs. area overhead under process variation, (3) limited EFMAC due to decoupling of software and hardware development.
Convolutional neural networks (CNNs) play a key role in deep learning applications. However, the large storage overheads and the substantial computation cost of CNNs are problematic in hardware ...accelerators. Computing-in-memory (CIM) architecture has demonstrated great potential to effectively compute large-scale matrix-vector multiplication. However, the intensive multiply and accumulation (MAC) operations executed at the crossbar array and the limited capacity of CIM macros remain bottlenecks for further improvement of energy efficiency and throughput. To reduce computation costs, network pruning and quantization are two widely studied compression methods to shrink the model size. However, most of the model compression algorithms can only be implemented in digital-based CNN accelerators. For implementation in a static random access memory (SRAM) CIM-based accelerator, the model compression algorithm must consider the hardware limitations of CIM macros, such as the number of word lines and bit lines that can be turned on at the same time, as well as how to map the weight to the SRAM CIM macro. In this study, a software and hardware co-design approach is proposed to design an SRAM CIM-based CNN accelerator and an SRAM CIM-aware model compression algorithm. To lessen the high-precision MAC required by batch normalization (BN), a quantization algorithm that can fuse BN into the weights is proposed. Furthermore, to reduce the number of network parameters, a sparsity algorithm that considers a CIM architecture is proposed. Last, MARS, a CIM-based CNN accelerator that can utilize multiple SRAM CIM macros as processing units and support a sparsity neural network, is proposed.
This paper proposes a miniature electronic nose (e-nose) for breath analysis. The proposed e-nose is composed of a sensor array, signal processing, and artificial intelligence. The conductive gas ...sensor is a micro-heating sensing device deposited by nanomaterials. The sensor signal is processed by an AI edge accelerator based on computing-in-memory (CIM) architecture, achieving an advanced system energy efficiency of 18.42 TOPs/W.
Energy-efficient always-on motion-detection (MD) sensors are in high demand and are widely used in machine vision applications. To achieve real-time and continuous motion monitoring, high-speed ...low-power temporal difference imagers with corresponding processing architectures are widely investigated 1-6. Event-based dynamic vision sensors (DVS) have been reported to reduce the redundant data and power through the asynchronous timestamped event-address readout 1, 2. However, DVS needs special data processing to collect enough events for information extraction, and suffers from noise and dynamic effects, which limits the advantages of low-latency pixel event reporting. Furthermore, low sensitivity (no integration) and lack of static information are also drawbacks of DVS. Frame-based MD rolling-shutter sensors 3, 4 were reported to reduce the data bandwidth and power by sub-sampling operation with the tradeoff of low resolution and motion blur. Global-shutter MD sensors were reported 5, 6 using in-pixel analog memory for reference image storage. However, such sensors require a special process technology for low off-state current device implementation. In a frame-based MD sensor, the required analog processing circuit and two successive frames for temporal difference operation comes at a cost in power, area, and speed. To address these drawbacks, we present a frame-based MD vision sensor featuring three operation modes: image-capture (IC), frame-difference (FD) with on/off event detection, and saliency-detection (SD). Using a low-voltage ping-pong PWM pixel and multi-mode operation, it achieves high-speed low-power full-resolution MD, consecutive event frame reporting, and image capture functions. Moreover, saliency detection by counting the block-level event number is also implemented for efficient optic flow extraction of the companion processing chip using simple neuromorphic circuits.