Recently the demand for high-bandwidth graphic DRAM, for game consoles and graphic cards, has dramatically increased due to the development of virtual reality, artificial intelligence, deep learning, ...autonomous driving cars, etc. These applications require greater data transfer speeds than pervious devices, GDDR5 1 and GDDR5X 2, which are limited to 12Gb/s/pin. This paper introduces an 8Gb GDDR6 operating at up to 16Gb/s/pin. To exceed the prior speed limit various bandwidth extension techniques are proposed. WCK is driven with a dividing scheme to overcome speed limitations and to reduce power consumption. In addition, a dual-band architecture with different types of nibble drivers is proposed in order to cover stability of CML-to-CMOS in all frequency regions; CML nibble is used for high-speed, while CMOS nibble is used for low-speed. A DC-split scheme is implemented for duty-cycle correction and skew compensation. The bandwidth of the high-frequency divider is extended by using a proposed mode-changed flip-flop. The receiver uses a loop-unrolled one-tap decision-feedback equalizer (DFE) designed to eliminate channel inter-symbol interference (ISI). A two-stage pre-amplifier is also used for bandwidth extension. The transmitter uses a 4:1 multiplexer using a half-rate sampler, where a 1UI pulse is unnecessary to minimize the full-rate operation. To secure on-chip signal transmission characteristic, the bandwidth limitation of transistor in a DRAM process is extended by adopting an on-chip feedback EQ filter.
Advances in silicon technology bring high-performance mobile devices and networks that connect people all over the world. In the meantime, data centers with high computational capabilities boost the ...prosperity of the social world. Emerging data centers keep requiring higher density memory, with higher data rates for processing large amounts of data. However, the implementation of high density DRAM is hindered by large chip area, causing degradation of the power distribution network (PDN) and higher yield losses due to the higher probability of die defects. This paper presents a 16Gb 3.2Gb/s/pin DDR4 SDRAM that features an improved PDN and a repair strategy. The PDN is reinforced by power pads with regulators in the middle of the bank area and a staggered power-up scheme for 3D stacked (3DS) DRAM. Yield is enhanced by introducing ECC for redundant cell operation and by developing an advanced built-in self-repair scheme that automatically corrects bit-errors at the application level.
This paper presents a compact resistor-based CMOS temperature sensor intended for dense thermal monitoring. It is based on an Formula Omitted poly-phase filter (PPF), whose temperature-dependent ...phase shift is read out by a frequency-locked loop (FLL). The PPF’s phase shift is determined by a zero-crossing (ZC) detector, allowing the rest of the FLL to be realized in an area-efficient manner. Implemented in a 65-nm CMOS technology, the sensor occupies only 7000 Formula Omitted. It can operate from supply voltages as low as 0.85 V and consumes 68 Formula Omitted. A sensor based on a PPF made from silicided p-poly resistors and metal–insulator–metal (MIM) capacitors achieves an inaccuracy of ±0.12 °C (3Formula Omitted) from −40 °Cto 85 °C and a resolution of 2.5 mK (rms) in a 1-ms conversion time. This corresponds to a resolution figure-of-merit (FoM) of 0.43 pJFormula Omitted.
A 1.3-4-GHz quadrature-phase digital delay-locked loop (DDLL) with sequential delay control and a reconfigurable delay line is designed using a 28 nm CMOS process. The time resolution of the DDLL is ...reduced by updating the delay code sequentially. A bidirectional shift register enables this operation with low power, resulting in bang-bang jitter that is three times smaller than that of a conventional DDLL. Conventional delay control is replaced with sequential delay control after a DDLL lock to reduce the locking time. A DDLL with a wide operation range is achieved with a reconfigurable delay line. Unlike the conventional DDLL, the minimum delay difference is adjustable in the proposed structure. To achieve a wide frequency range, the minimum delay difference of the quadrature clock is increased or decreased in three operation modes. To compensate for local variations in the CMOS process, a skew calibration circuit is implemented with the DDLL. The hardware cost of skew calibration is minimized with the proposed DDLL because it shares the subblocks for sequential delay control. The average phase difference from the quadrature clocks becomes the reference for the 90° phase for skew correction. A duty-cycle corrector (DCC) is implemented by collecting the positive edges of the quadrature-phase clocks. The DDLL consumes 6.5 mW at the maximum clock frequency of 4 GHz. The peak-to-peak jitter is improved from 15.6 to 12.5 ps with sequential delay control.
With advances in deep-neural-network applications the increasingly large data movement through memory channels is becoming inevitable: specifically, RNN and MLP applications are memory bound and the ...memory is the performance bottleneck 1. DRAM featuring processing in memory (PIM) significantly reduces data movement 1-4, and the system performance is enhanced by the large internal parallel bank bandwidth. Among DRAM-based PIM proposals, 3 is near commercialization, but the required HBM technology may prevent it from being applied to other applications due to its high cost 5. In this situation, an accelerator-in-memory (AiM) based on GDDR6 may be applicable: it has a relatively low-cost, is compatible with GDDR6 interface, and is designed to accelerate deep-learning (DL) applications. AiM offers a peak throughput of 1 TFLOPS with processing units (PUs) with a speed of 1 GHz utilizing the characteristics of GDDR6 with a speed of 16Gb/s. It can also support many applications as it has various activation functions. This paper first looks at the AiM architecture and the supported command set for DL operations. Next, the DL operations in the PU and supported activation functions are described. Finally, we present evaluation results of DL behavior of AiM at the package and the system level.
This paper presents a compact resistor-based CMOS temperature sensor intended for dense thermal monitoring. It is based on an <inline-formula> <tex-math notation="LaTeX">RC ...</tex-math></inline-formula> poly-phase filter (PPF), whose temperature-dependent phase shift is read out by a frequency-locked loop (FLL). The PPF's phase shift is determined by a zero-crossing (ZC) detector, allowing the rest of the FLL to be realized in an area-efficient manner. Implemented in a 65-nm CMOS technology, the sensor occupies only 7000 <inline-formula> <tex-math notation="LaTeX">\mu \text{m}~^{\mathrm{ 2}} </tex-math></inline-formula>. It can operate from supply voltages as low as 0.85 V and consumes 68 <inline-formula> <tex-math notation="LaTeX">\mu \text{W} </tex-math></inline-formula>. A sensor based on a PPF made from silicided p-poly resistors and metal-insulator-metal (MIM) capacitors achieves an inaccuracy of ±0.12 °C (3<inline-formula> <tex-math notation="LaTeX">\sigma </tex-math></inline-formula>) from −40 °Cto 85 °C and a resolution of 2.5 mK (rms) in a 1-ms conversion time. This corresponds to a resolution figure-of-merit (FoM) of 0.43 pJ<inline-formula> <tex-math notation="LaTeX">\cdot \text{K}~^{\mathrm{ 2}} </tex-math></inline-formula>.
In this article, a 1.25-V 8-Gb, 16-Gb/s/pin GDDR6-based accelerator-in-memory (AiM) is presented. A dedicated command (CMD) set for deep learning (DL) is introduced to minimize latency when switching ...operation modes, and a bank-wide mantissa shift (BWMS) scheme is adopted to minimize calculation delay time, current consumption, and circuit area during multiply-accumulate (MAC) operation. By storing the lookup table (LUT) in the reserved word line in the dynamic random access memory (DRAM) bank cell, it is possible to support various activation functions (AFs), such as Gaussian error linear unit (GELU), sigmoid, and Tanh as well as rectified linear unit (ReLU) and Leaky ReLU. Performance evaluation was conducted by measuring the fabricated chip in ATE and a self-manufactured field-programmable gate array (FPGA)-based system. In the ATE-level evaluation, it operates at 16 Gbps up to a voltage as low as 1.10 V. When evaluated by GEMV and MNIST in the FPGA-based system, it was confirmed that the performance gains of 7.5-10.5 times were possible compared to the HBM2-based or GDDR6-based systems.
A 1.1-V 6.4-Gb/s/pin 16-Gbit DDR5 is presented in 10-nm class CMOS technology. Various functions and circuits' techniques are newly adopted to improve performance and power consumption compared with ...DDR4 SDRAM. First, to realize two times higher speed than DDR4, the injection-locked oscillator (ILO) delay locked loop (DLL) is adopted for the low jitter high-speed performance. The proposed DLL with phase rotator (PR) and ILO allows to minimize the clock tree of DRAM, lowering skew and jitter in the DRAM internal clock path. Second, for the high-speed write operation, DQS gate opening control and write leveling are very important to minimize the turnaround time of DRAM, and thus new sequence and logic for the write-level training are introduced in this article. Third, to maximize the data valid window of read DQs, duty cycle adjustable serialize circuit methods are proposed. Finally, to improve the interface speed, the decision feedback equalization (DFE) and feedforward equalization (FFE) are adopted to Rx and Tx, respectively. By implementing all the items mentioned earlier, the 1.1-V 6.4-Gb/s/pin 16-Gbit DDR5 achieved 6.4-Gb/s/pin performance at 1.05-V V DD , with its power bandwidth efficiency 30% higher than that of DDR4.
The demand for high-performance graphics systems used for artificial intelligence, cloud game, and virtual reality continues to grow; this trend requires graphics systems to achieve ever higher ...bandwidths. This article proposes a GDDR6 dynamic random access memory (DRAM) with a half-rate clocking architecture and optimized receiver and transmitter to improve high-speed operation. Furthermore, this article adopts a staggered PAD using the redistribution layer (RDL) to reduce the distance to four PADs; it enables the mitigation of bandwidth limitation of half-rate clocking, a lower phase mismatch, and a reduced propagation delay. The proposed half-rate clocking-based GDDR6 DRAM achieves 24 Gb/s/pin on a 1.35-V DRAM process. Also, the power-supply-induced-jitter (PSIJ) value is improved from 9.97 to 3.22 ps, compared to a GDDR6 design using a quarter-rate clocking. In addition, the phase mismatch of the proposed clock distribution network (CDN) is reduced compared to the conventional CDN, resulting in an improvement of the 3-<inline-formula> <tex-math notation="LaTeX">\sigma </tex-math></inline-formula> value of the phase skew from 4.16 to 2.25 ps.
This paper introduces a physical layout design methodology that produces DRC-clean, area-efficient, and programmable layouts of digital circuits in advanced DRAM processes. The proposed methodology ...automates the layout generation process to enhance design productivity, while still providing rich customization for efficient area and routing resource utilizations. Process-specific parameterized cells (PCells) are combined with process-independent place-and-route functions to automatically generate area-efficient and programmable layouts. Routing grids are optimized to enhance the area and routing efficiency. The proposed method reduced the design time of digital layouts by 80% compared to a manual design with high layout qualities, significantly enhancing the design productivity.