With advances in deep-neural-network applications the increasingly large data movement through memory channels is becoming inevitable: specifically, RNN and MLP applications are memory bound and the ...memory is the performance bottleneck 1. DRAM featuring processing in memory (PIM) significantly reduces data movement 1-4, and the system performance is enhanced by the large internal parallel bank bandwidth. Among DRAM-based PIM proposals, 3 is near commercialization, but the required HBM technology may prevent it from being applied to other applications due to its high cost 5. In this situation, an accelerator-in-memory (AiM) based on GDDR6 may be applicable: it has a relatively low-cost, is compatible with GDDR6 interface, and is designed to accelerate deep-learning (DL) applications. AiM offers a peak throughput of 1 TFLOPS with processing units (PUs) with a speed of 1 GHz utilizing the characteristics of GDDR6 with a speed of 16Gb/s. It can also support many applications as it has various activation functions. This paper first looks at the AiM architecture and the supported command set for DL operations. Next, the DL operations in the PU and supported activation functions are described. Finally, we present evaluation results of DL behavior of AiM at the package and the system level.
In this article, a 1.25-V 8-Gb, 16-Gb/s/pin GDDR6-based accelerator-in-memory (AiM) is presented. A dedicated command (CMD) set for deep learning (DL) is introduced to minimize latency when switching ...operation modes, and a bank-wide mantissa shift (BWMS) scheme is adopted to minimize calculation delay time, current consumption, and circuit area during multiply-accumulate (MAC) operation. By storing the lookup table (LUT) in the reserved word line in the dynamic random access memory (DRAM) bank cell, it is possible to support various activation functions (AFs), such as Gaussian error linear unit (GELU), sigmoid, and Tanh as well as rectified linear unit (ReLU) and Leaky ReLU. Performance evaluation was conducted by measuring the fabricated chip in ATE and a self-manufactured field-programmable gate array (FPGA)-based system. In the ATE-level evaluation, it operates at 16 Gbps up to a voltage as low as 1.10 V. When evaluated by GEMV and MNIST in the FPGA-based system, it was confirmed that the performance gains of 7.5-10.5 times were possible compared to the HBM2-based or GDDR6-based systems.
With the emergence of large-language models (LLM) and generative AI, which require an enormous amount of model parameters, the required memory bandwidth and capacity for high-end systems is on an ...unprecedented increase. To meet this need, we present an extended version of the high-bandwidth memory-3 (HBM3 DRAM), HBM3E, which achieves a 1280GB/s bandwidth with a cube density of 48GB. New design schemes and features, such as all-around power-through-silicon via (TSV), a 6-phase read-data-strobe (RDQS) scheme, a byte-mapping swap scheme, and a voltage-drift compensator for write data strobe (WDQS), are implemented to achieve extended bandwidth and capacity with enhanced reliability. The overall architecture and specifications, such as bump map footprint, the number of channel and I/Os, and the operation voltage, are identical to the latest HBM3 1, 2; therefore, backward compatibility is provided, avoiding system modification.
This poster presents system architecture, software stack, and performance analysis for SK hynix's very first GDDR6-based processing-in-memory (PIM) product sample, called Accelerator-in-Memory ...(AiM).AiM is designed for the in-memory acceleration of matrix-vector product operations, which are commonly found in machine learning applications. The strength of AiM primarily comes from the two design factors, which are 1) all-bank operation support and 2) extended DRAM command set. All-bank operations allow AiM to fully utilize the abundant internal DRAM bandwidth, which makes it an attractive solution for memory-bound applications. The extended command set allows the host to address these new operations efficiently and provides a clean separation of concerns between the AiM architecture and its software stack design.We present a dedicated FPGA-based reference platform with a software stack, which is used to validate AiM design and evaluate its system-level performance. We also demonstrate FMC-based AiM extension cards that are compatible with the off-the-shelf FPGA boards and serve as an open research platform allowing potential collaborators and academic institutes to access our hardware and software systems.