This paper studies the problem of designing analog circuits to achieve target specifications, which can be formulated as a multi-objective combinatorial optimization (MOCO) under uncertainty. We ...address this challenging problem using the g m / I D methodology and a reinforcement learning (RL) framework. The proposed fast RL-based analog circuit designer (fRL-AD) maintains circuits' DC bias conditions while determining their sizing parameters associated with AC characteristics. This ensures robust convergence to optimal sizing parameters across target specifications and proficiently captures layout effects. Specifically, by decomposing the problem into a sequence of feasible problems, our pre-trained RL agent can efficiently seek a solution for each feasible problem by generating states (i.e., candidate solutions) following a learned policy. Since the sequence of feasible regions is designed to approach an optimal solution to our main problem, the RL agent can find a near-optimal solution by sequentially tackling the feasible problems. Remarkably, using better initial points (or states), our approach is more efficient than directly solving the last feasible problem. Furthermore, we introduce an adaptive action space in our RL framework, which can dynamically modulate the size of the action space elements. The proposed method can provide an effective and stable design of various analog circuits, overcoming their traditionally low productivity due to reliance on human expertise and time-consuming simulations to handle uncertainties. We verify the effectiveness of our algorithm via experiments with various analog circuit topologies.
Advances in machine learning (ML) have ignited hardware innovations for efficient execution of the ML models many of which are memory-bound (e.g., long short-term memories, multi-level perceptrons, ...and recurrent neural networks). Specifically, inference using these ML models with small batches, as would be the case at the Cloud edge, has little reuse of the large filters and is deeply memory-bound. Simultaneously, processing-in or -near memory (PIM or PNM) is promising unprecedented high-bandwidth connection between compute and memory. Fortunately, the memory-bound ML models are a good fit for PIM. We focus on digital PIM which provides higher bandwidth than PNM and does not incur the reliability issues of analog PIM. Previous PIM and PNM approaches advocate full processor cores which do not conform to PIM's severe area and power constraints. We describe Newton, a major DRAM maker's upcoming accelerator-in-memory (AiM) product for machine learning, which makes the following contributions: (1) To satisfy PIM's area constraints, Newton (a) places a minimal compute of only multiply-accumulate units and buffers in the DRAM which avoids the full-core area and power overheads of previous work and thus makes PIM feasible for the first time, and (b) employs a DRAM-like interface for the host to issue commands to the PIM compute. The PIM compute is rate-matched to the internal DRAM bandwidth and employs a non-intuitive, global input vector buffer shared by the entire channel to capture input reuse while amortizing buffer area cost. To the host, Newton's interface is indistinguishable from regular DRAM without any offloading overheads and PIM/non-PIM mode switching, and with the same deterministic latencies even for floating-point commands. (2) To prevent the PIM-host interface from becoming a bottleneck, we include three optimizations: commands which gang multiple compute operations both within a bank and across banks; complex, multi-step compute commands - both of which save critical command bandwidth; and targeted reduction of t FAW overhead. (3) To capture output vector reuse with reasonable buffering, Newton employs an unusually-wide interleaved layout for the matrix. Our simulations running state-of-the-art neural networks show that building on a realistic HBM2E-like DRAM, Newton achieves 10x and 54x average speedup over a non-PIM system with infinite compute that perfectly uses the external DRAM bandwidth and a realistic GPU, respectively.
With advances in deep-neural-network applications the increasingly large data movement through memory channels is becoming inevitable: specifically, RNN and MLP applications are memory bound and the ...memory is the performance bottleneck 1. DRAM featuring processing in memory (PIM) significantly reduces data movement 1-4, and the system performance is enhanced by the large internal parallel bank bandwidth. Among DRAM-based PIM proposals, 3 is near commercialization, but the required HBM technology may prevent it from being applied to other applications due to its high cost 5. In this situation, an accelerator-in-memory (AiM) based on GDDR6 may be applicable: it has a relatively low-cost, is compatible with GDDR6 interface, and is designed to accelerate deep-learning (DL) applications. AiM offers a peak throughput of 1 TFLOPS with processing units (PUs) with a speed of 1 GHz utilizing the characteristics of GDDR6 with a speed of 16Gb/s. It can also support many applications as it has various activation functions. This paper first looks at the AiM architecture and the supported command set for DL operations. Next, the DL operations in the PU and supported activation functions are described. Finally, we present evaluation results of DL behavior of AiM at the package and the system level.
In this article, a 1.25-V 8-Gb, 16-Gb/s/pin GDDR6-based accelerator-in-memory (AiM) is presented. A dedicated command (CMD) set for deep learning (DL) is introduced to minimize latency when switching ...operation modes, and a bank-wide mantissa shift (BWMS) scheme is adopted to minimize calculation delay time, current consumption, and circuit area during multiply-accumulate (MAC) operation. By storing the lookup table (LUT) in the reserved word line in the dynamic random access memory (DRAM) bank cell, it is possible to support various activation functions (AFs), such as Gaussian error linear unit (GELU), sigmoid, and Tanh as well as rectified linear unit (ReLU) and Leaky ReLU. Performance evaluation was conducted by measuring the fabricated chip in ATE and a self-manufactured field-programmable gate array (FPGA)-based system. In the ATE-level evaluation, it operates at 16 Gbps up to a voltage as low as 1.10 V. When evaluated by GEMV and MNIST in the FPGA-based system, it was confirmed that the performance gains of 7.5-10.5 times were possible compared to the HBM2-based or GDDR6-based systems.
As DNNs improving state-of-the-art accuracy on many artificial intelligence (AI) applications such as computer vision processing for autonomous driving, the data processing bandwidth and power ...consumption between neural network accelerator and the off-chip memory are big challenge to enhance the compute performance metric TOPs/watt. To overcome the limited compute and energy resources in automobile environment, inferencing by PIM (Processing in Memory) or AiM (Accelerator in Memory) which deployed MAC(Multiply and Accumulation) units and activation function inside DRAM is one of the key solution by using multi bank parallelism and memory cell architecture. When memory technology equipped with analog logic inside mature in the near future, ultra-low power analog accelerator based neuromorphic computing architecture will lead the future autonomous driving solution.
With the recent increasing interest in big data and artificial intelligence, there is an emerging demand for high-performance memory system with large density and high data-bandwidth. However, ...conventional DIMM-type memory has difficulty achieving more than 50GB/s due to its limited pin count and signal integrity issues. High-bandwidth memory (HBM) DRAM, with TSV technology and wide IOs, is a prominent solution to this problem, but it still has many limitations: including power consumption and reliability. This paper presents a power-efficient structure of TSVs with reliability and a cost-effective HBM DRAM core architecture.
In this paper, HBM DRAM with TSV technique is introduced. This paper covers the general TSV feature and techniques such as TSV architecture, TSV reliability, TSV open / short test, and TSV repair. ...And HBM DRAM, representative DRAM product using TSV, is widely presented, especially the use and features.
With the emergence of large-language models (LLM) and generative AI, which require an enormous amount of model parameters, the required memory bandwidth and capacity for high-end systems is on an ...unprecedented increase. To meet this need, we present an extended version of the high-bandwidth memory-3 (HBM3 DRAM), HBM3E, which achieves a 1280GB/s bandwidth with a cube density of 48GB. New design schemes and features, such as all-around power-through-silicon via (TSV), a 6-phase read-data-strobe (RDQS) scheme, a byte-mapping swap scheme, and a voltage-drift compensator for write data strobe (WDQS), are implemented to achieve extended bandwidth and capacity with enhanced reliability. The overall architecture and specifications, such as bump map footprint, the number of channel and I/Os, and the operation voltage, are identical to the latest HBM3 1, 2; therefore, backward compatibility is provided, avoiding system modification.
While the fast-growing big-data and cloud-computing markets are driving demand for server-oriented high-capacity memory, the high costs and high-power consumption due these high capacities can be ...major problems when building a server. In this situation, a high-capacity 512GB managed DRAM solution (MDS), bigger than any DIMM currently available, can be a good alternative. Despite such a large capacity, MDS has a price competitiveness due to 26% higher dies per wafer than a conventional dram, and uses 12W of power, which is similar to existing LRDIMM solutions that have a smaller density.
There is enormous demand for high-bandwidth DRAM: in application such as HPC, graphics, high-end server and artificial intelligence. HBM DRAM was developed 1 using the advances in package technology: ...TSV, microbump and silicon-interposer. Owing to these advances, HBM has a much higher bandwidth, at a lower pin speed rate, than conventional DRAM. However, the 3D-stack structure causes TSV interface and PDN problems: TSV connection failure and 3D-accumulation of IR drop, which increases the total cost of HBM. Moreover, as memory bandwidth increases DRAM architectural challenges arise, power consumption and associated thermal problems increase as well.