The demands on higher bandwidth with reduced power consumption in mobile market are driving mobile DRAM with advanced design techniques. Proposed LPDDR4 in this paper achieves over 39% improvement in ...power efficiency and over 4.3 Gbps data rate with 1.1 V supply voltage. These are challenging targets compared with those of LPDDR3. This work describes design schemes employed in LPDDR4 to satisfy these requirements, such as multi-channel-per-die architecture, multiple training modes, low-swing interface, DQS and clock frequency dividing, and internal reference for data and command-address signals. This chip was fabricated in a 3-metal 2y-nm DRAM CMOS process.
With advances in deep-neural-network applications the increasingly large data movement through memory channels is becoming inevitable: specifically, RNN and MLP applications are memory bound and the ...memory is the performance bottleneck 1. DRAM featuring processing in memory (PIM) significantly reduces data movement 1-4, and the system performance is enhanced by the large internal parallel bank bandwidth. Among DRAM-based PIM proposals, 3 is near commercialization, but the required HBM technology may prevent it from being applied to other applications due to its high cost 5. In this situation, an accelerator-in-memory (AiM) based on GDDR6 may be applicable: it has a relatively low-cost, is compatible with GDDR6 interface, and is designed to accelerate deep-learning (DL) applications. AiM offers a peak throughput of 1 TFLOPS with processing units (PUs) with a speed of 1 GHz utilizing the characteristics of GDDR6 with a speed of 16Gb/s. It can also support many applications as it has various activation functions. This paper first looks at the AiM architecture and the supported command set for DL operations. Next, the DL operations in the PU and supported activation functions are described. Finally, we present evaluation results of DL behavior of AiM at the package and the system level.
This article introduces a 192-Gb 896-GB/s 12-high stacked third-generation high-bandwidth memory (HBM3 DRAM) with low power consumption and high-reliability traits. New design schemes and features, ...including internal low-voltage signaling, center strobe calibration, through-silicon via (TSV) auto-calibration, a symbol-correcting in-DRAM ECC, and machine-learning-based layout optimization, allow large amounts of data transfers among the vertically stacked base and core dies with limited delay mismatch or SI degradation, as well as reduced power consumption from low-voltage swings. Experimental results confirm 896-GB/s bandwidth operations at 1.0-V voltage conditions with up to 15% improved power efficiency.
The demands on higher bandwidth with reduced power consumption in mobile market are driving mobile DRAM to have advanced design techniques. Proposed LPDDR4 in this paper achieves over 30% improved ...power efficiency and over 4.3Gbps data rate with 1.1V supply voltage. These are challenging target comparing with that of LPDDR3. This works includes various techniques including multi-channel per die, various trainings, low swing interface, DQS and clock frequency dividing, internal reference voltage for data and command-address signals and so on. This chip was fabricated in a 3-metal 2y-nm DRAM CMOS process.
In this article, a 1.25-V 8-Gb, 16-Gb/s/pin GDDR6-based accelerator-in-memory (AiM) is presented. A dedicated command (CMD) set for deep learning (DL) is introduced to minimize latency when switching ...operation modes, and a bank-wide mantissa shift (BWMS) scheme is adopted to minimize calculation delay time, current consumption, and circuit area during multiply-accumulate (MAC) operation. By storing the lookup table (LUT) in the reserved word line in the dynamic random access memory (DRAM) bank cell, it is possible to support various activation functions (AFs), such as Gaussian error linear unit (GELU), sigmoid, and Tanh as well as rectified linear unit (ReLU) and Leaky ReLU. Performance evaluation was conducted by measuring the fabricated chip in ATE and a self-manufactured field-programmable gate array (FPGA)-based system. In the ATE-level evaluation, it operates at 16 Gbps up to a voltage as low as 1.10 V. When evaluated by GEMV and MNIST in the FPGA-based system, it was confirmed that the performance gains of 7.5-10.5 times were possible compared to the HBM2-based or GDDR6-based systems.
The LPDDR product family originally sought to minimize power consumption. As the LPDDR5X is released with a 33% increase in maximum operating speed, low-power and high-speed operations have become ...essential 1. Moreover, expanding virtual space (AR, XR) and the artificial intelligence industry have accelerated demand for high-speed and low-power mobile DRAM products. Therefore, guaranteeing higher IO speed has become a significant concern in designing DRAM. This paper proposes a WCK correction strategy, a voltage-offset-calibrated receiver, and IO for DRAM test to achieve high-speed operation. The WCK correction strategy improves the 3-sigma 4-phase skew distribution by 65%. The offset-calibrated receiver with a 1-tap decision-feedback equalizer (DFE) enhances the voltage offset by 59%. IO for DRAM test reduces the parasitic capacitance of the DQ pad by up to 39%. Using these techniques for high-speed operation, the LPDDR5X achieves operating speed up to 10.5Gb/s at V_{DD2H}=1.05V and 10.0Gb/s at V_{DD2H}=0.95V.
The increase in GPU-based AI applications, cloud-based gaming, and video streaming services has driven the need for new a graphics memory that operates at higher bandwidth and power efficiency than ...existing GDDR6 SDRAM, leading to the introduction of the GDDR7 standard 1. Since performance degradation due to thermal throttling, power cost, and device reliability are major development considerations in high-power graphics applications, PAM3 signaling is applied on single-ended pins to improve bandwidth and power consumption, while maintaining the clock frequency 2. However, new PAM3-related blocks supporting double the bandwidth inevitably increase in absolute power and temperature. In this paper, we present additional power reduction techniques, while maintaining SNR. The clocking architecture, with fast wake-up capabilities, can be partially disabled to provide active-standby current (IDD3N) as low as the power-down mode. The PAM3 TX and RX use a design approach that achieves high SNR and power efficiency.
Ever since the introduction of high bandwidth memory (HBM DRAM) and its succeeding line-ups, HBM DRAM has been heralded as a prominent solution to tackle the memory wall problem. However, despite ...continual memory advancements the advent of high-end systems, including supercomputers, hyper-scale data centers and machine learning accelerators, are expediting requirements for higher-performance memory solutions. To accommodate the increasing system-level demands, we introduce HBM3 DRAM, which employs multiple new features and design schemes. Techniques such as an on-die ECC engine, internal NN-DFE I/O signaling, TSV auto-calibration, and layout optimization based on machine-learning algorithms are implemented to efficiently control timing skew margins and SI degradation trade-offs. Furthermore, reduced voltage swings allow for improved memory bandwidth, density, power efficiency and reliability.
With the emergence of large-language models (LLM) and generative AI, which require an enormous amount of model parameters, the required memory bandwidth and capacity for high-end systems is on an ...unprecedented increase. To meet this need, we present an extended version of the high-bandwidth memory-3 (HBM3 DRAM), HBM3E, which achieves a 1280GB/s bandwidth with a cube density of 48GB. New design schemes and features, such as all-around power-through-silicon via (TSV), a 6-phase read-data-strobe (RDQS) scheme, a byte-mapping swap scheme, and a voltage-drift compensator for write data strobe (WDQS), are implemented to achieve extended bandwidth and capacity with enhanced reliability. The overall architecture and specifications, such as bump map footprint, the number of channel and I/Os, and the operation voltage, are identical to the latest HBM3 1, 2; therefore, backward compatibility is provided, avoiding system modification.
DRAM products have been recently adopted in a wide range of high-performance computing applications: such as in cloud computing, in big data systems, and loT devices. This demand creates larger ...memory capacity requirements, thereby requiring aggressive DRAM technology node scaling to reduce the cost per bit 1, 2. However, DRAM manufacturers are facing technology scaling challenges due to row hammer and refresh retention time beyond 1a-nm 2. Row hammer is a failure mechanism, where repeatedly activating a DRAM row disturbs data in adjacent rows. Scaling down severely threatens reliability since a reduction of DRAM cell size leads to a reduction in the intrinsic row hammer tolerance 2, 3. To improve row hammer tolerance, there is a need to probabilistically activate adjacent rows with carefully sampled active addresses and to improve intrinsic row hammer tolerance 2. In this paper, row-hammer-protection and refresh-management schemes are presented to guarantee DRAM security and reliability despite the aggressive scaling from 1a-nm to sub 10-nm nodes. The probabilistic-aggressor-tracking scheme with a refresh-management function (RFM) and per-tow hammer tracking (PRHT) improve DRAM resilience. A multi-step precharge reinforces intrinsic row-hammer tolerance and a core-bias modulation improves retention time: even in the face of cell-transistor degradation due to technology scaling. This comprehensive scheme leads to a reduced probability of failure, due to row hammer attacks, by 93.1% and an improvement in retention time by 17%.