Computation-in-memory (CIM) is a promising candidate to improve the energy efficiency of multiply-and-accumulate (MAC) operations of artificial intelligence (AI) chips. This work presents an static ...random access memory (SRAM) CIM unit-macro using: 1) compact-rule compatible twin-8T (T8T) cells for weighted CIM MAC operations to reduce area overhead and vulnerability to process variation; 2) an even-odd dual-channel (EODC) input mapping scheme to extend input bandwidth; 3) a two's complement weight mapping (C2WM) scheme to enable MAC operations using positive and negative weights within a cell array in order to reduce area overhead and computational latency; and 4) a configurable global-local reference voltage generation (CGLRVG) scheme for kernels of various sizes and bit precision. A 64 × 60 b T8T unit-macro with 1-, 2-, 4-b inputs, 1-, 2-, 5-b weights, and up to 7-b MAC-value (MACV) outputs was fabricated as a test chip using a foundry 55-nm process. The proposed SRAM-CIM unit-macro achieved access times of 5 ns and energy efficiency of 37.5-45.36 TOPS/W under 5-b MACV output.
Previous SRAM-based computing-in-memory (SRAM-CIM) macros suffer small read margins for high-precision operations, large cell array area overhead, and limited compatibility with many input and weight ...configurations. This work presents a 1-to-8-bit configurable SRAM CIM unit-macro using: 1) a hybrid structure combining 6T-SRAM based in-memory binary product-sum (PS) operations with digital near-memory-computing multibit PS accumulation to increase read accuracy and reduce area overhead; 2) column-based place-value-grouped weight mapping and a serial-bit input (SBIN) mapping scheme to facilitate reconfiguration and increase array efficiency under various input and weight configurations; 3) a self-reference multilevel reader (SRMLR) to reduce read-out energy and achieve a sensing margin 2<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> that of the mid-point reference scheme; and 4) an input-aware bitline voltage compensation scheme to ensure successful read operations across various input-weight patterns. A 4-Kb configurable 6T-SRAM CIM unit-macro was fabricated using a 55-nm CMOS process with foundry 6T-SRAM cells. The resulting macro achieved access times of 3.5 ns per cycle (pipeline) and energy efficiency of 0.6-40.2 TOPS/W under binary to 8-b input/8-b weight precision.
This article presents a computing-in-memory (CIM) structure aimed at improving the energy efficiency of edge devices running multi-bit multiply-and-accumulate (MAC) operations. The proposed scheme ...includes a 6T SRAM-based CIM (SRAM-CIM) macro capable of: 1) weight-bitwise MAC (WbwMAC) operations to expand the sensing margin and improve the readout accuracy for high-precision MAC operations; 2) a compact 6T local computing cell to perform multiplication with suppressed sensitivity to process variation; 3) an algorithm-adaptive low MAC-aware readout scheme to improve energy efficiency; 4) a bitline header selection scheme to enlarge signal margin; and 5) a small-offset margin-enhanced sense amplifier for robust read operations against process variation. A fabricated 28-nm 64-kb SRAM-CIM macro achieved access times of 4.1-8.4 ns with energy efficiency of 11.5-68.4 TOPS/W, while performing MAC operations with 4- or 8-b input and weight precision.
Computing-in-memory (CIM) based on SRAM is a promising approach to achieving energy-efficient multiply-and-accumulate (MAC) operations in artificial intelligence (AI) edge devices; however, existing ...SRAM-CIM chips support only DNN inference. The flow of training data requires that CIM arrays perform convolutional computation using transposed weight matrices. This article presents a two-way transpose (TWT) multiply cell with high resistance to process variation and a novel read scheme that uses input-aware zone prediction of maximum partial MAC values to enhance the signal margin for robust readout. A 28-nm 64-kb TWT CIM macro fabricated using foundry-provided compact 6T-SRAM cells achieved <inline-formula> <tex-math notation="LaTeX">T_{\text {AC}} </tex-math></inline-formula> of 3.8-21 ns and energy efficiency of 7-61.1 TOPS/W in performing MAC operations using 2-8-b inputs, 4-8-b weights, and 10-20-b outputs.
The emerging edge intelligence requires low-cost energy-efficient neural network (NN) processors. Supporting various types of edge NN models leads to extra circuit overhead. Designing a unified NN ...processor with high energy/area efficiency is challenging. This work presents a frequency-domain-accelerated unified NN processor, named STICKER-T. It combines algorithm, architecture, and circuit-level optimization to achieve high energy/area efficiency. By utilizing the block-circulant NN (CirCNN) algorithm, this work supports frequency-domain acceleration and a unified workflow for convolutional, fully connected, and recurrent NN (CNN/FC/RNN). Three key innovations are proposed. First, a block-circulant-accelerated chip architecture is implemented to support unified CNN/FC/RNN workflow. Second, a multi-bit 8-128-point global-parallel local-bit-serial fast Fourier transform (FFT) module is designed for efficient high-throughput FFT/inverse FFT (IFFT) operation. Third, by utilizing a 6T hierarchical-bitline-switching transpose-SRAM (HBST-TRAM), 2-D data reuse is enabled in the proposed multi-bit frequency-domain multiply-accumulate (MAC) array. STICKER-T was fabricated in a 65-nm CMOS technology. It can operate at 0.54-1.15 V and 25-200 MHz with 13.3-339-mW power consumption. The peak energy efficiency achieves 140.3 TOPS/W. It shows 8.1<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> area efficiency and 4.2<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> energy efficiency at 4-bit precision compared with the state-of-the-art reconfigurable NN processor.
Computation-in-memory (CIM) is a promising avenue to improve the energy efficiency of multiply-and-accumulate (MAC) operations in AI chips. Multi-bit CNNs are required for high-inference accuracy in ...many applications 1-5. There are challenges and tradeoffs for SRAM-based CIM: (1) tradeoffs between signal margin, cell stability and area overhead; (2) the high-weighted bit process variation dominates the end-result error rate; (3) trade-off between input bandwidth, speed and area. Previous SRAM CIM macros were limited to binary MAC operations for fully connected networks 1, or they used CIM for multiplication 2 or weight-combination operations 3 with additional large-area near-memory computing (NMC) logic for summation or MAC operations.
In-memory computing has emerged as a promising solution to address the logic-memory performance gap. We propose design techniques using monolithic-3D integration to achieve reliable multirow ...activation which in turn help in computation as part of data readout. Our design is 1.8x faster than the existing techniques for Boolean computations. We quantitatively show no impact to cell stability when multiple rows are activated and thereby requiring no extra hardware for maintaining the cell stability during computations. In-memory digital to analog conversion technique is proposed using a 3D-CAM primitive. The design utilizes relatively low strength layer-2 transistors effectively and provides 7x power savings when compared with a specialized converter in-memory. Lastly, we present a linear classifier system by making use the above-mentioned techniques which is 47x faster while computing vector matrix multiplication using a dedicated hardware engine.
This paper presents the first monolithic 3D two-layer reconfigurable SRAM macro capable of executing multiple Compute-in-Memory (CiM) tasks as part of data readout. Fabricated using low cost FinFET ...based 3D+-IC, the SRAM offers concurrent data read from both layers and write from layer 2 with 0.4V \text{V}_{\text{dd}\min} 12.8x improved computation latency is achieved as compared to near memory computation of successive Boolean operations.
Advanced AI edge chips require multibit input (IN), weight (W), and output (OUT) for CNN multiply-and-accumulate (MAC) operations to achieve an inference accuracy that is sufficient for practical ...applications. Computing-in-memory (CIM) is an attractive approach to improve the energy efficiency (EFMAC of MAC operations under a memory-wall constraint. Previous SRAM-CIM macros demonstrated a binary MAC 4, an in-array 8b W-merging with near-memory computing (NMC) using 6T SRAM cells (limited output precision) 5, a 7b1N-1 bW MAC using a 10T SRAM cell (large area) 3, an 4b1N-5bW MAC with a T8T SRAM cell 1, and 8b1N-1bW NMC with 8T SRAM (long MAC latency (TAC)) 2. However, previous works have not achieved high IN/W/OUT precision with fast TAC compact-area, high EFMAC, and robust readout against process variation, due to (1) small sensing margin in word-wise multiple-bit MAC operations, (2) a tradeoff between read accuracy vs. area overhead under process variation, (3) limited EFMAC due to decoupling of software and hardware development.
Many Al edge devices require local intelligence to achieve fast computing time (t AC ), high energy efficiency (EF), and privacy. The transfer-learning approach is a popular solution for Al edge ...chips, wherein data used to re-train the Al in the cloud is used to fine-tune (re-train) a few of the neural layers in edge devices. This enables the dynamic incorporation of data from in-situ environments or private information. Computing-in-memory (CIM) is a promising approach to improve EF for Al edge chips, existing CIM schemes support inference 1-5 with forward (FWD) propagation; however, they do not support training, requiring both FWD and backward (BWD) propagation, due to differences in weight-access flow for FWD and BWD propagation. As Fig. 15.2.1 shows, efforts to increase the precision of the input (IN), weight (W), and/or output (OUT) tend to degrade r AC and EF for training operations irrespective of scheme: digital FWD and BWD (DF-DB) or CIM-FWD-digital-BWD (CiMF-DB). This work develops a two-way transpose (TWT) SRAM-CIM macro supporting multibit MAC operations for FWD and BWD propagation with fast r AC and high EF within a compact area. The proposed scheme features (1) A TWT multiply cell (TWT-MC) with a high resistance to process variation; and (2) a small-offset gain-enhancement sense amplifier (SOGE-SA) to tolerate a small read margin. A 28nm 64Kb TWT SRAM-CIM macro was fabricated using a foundry-provided compact 6T-SRAM cell for SRAM-CIM devices supporting both inference and training operations for the first time. This macro also demonstrates the fastest t AC (3.8 - 21ns) and highest EF (7 - 61.1TOPS/w) for MAC operations using 2 - 8b inputs, 4 - 8b weights and 12 − 20b outputs.