Top- K Sorting is a widely used technique for se-lecting the largest or smallest numbers from input elements. In this paper, we present an efficient low-power and flexible top-K sorting architecture ...with cell gating on field-programmable gate arrays (FPGAs). Our architecture consists of a data filter unit, cell counter, and L-cascaded sorting cells, where the filter unit allows users to select the user-defined data values to sort, the cell counter allows cell gating by counting the number of working cells, and the sorting cells allow sorting with low power and flexible K length with cell gating. The proposed sorting cells update incoming data continuously. In addition, cell gating is introduced to increase the flexibility of top- K length and energy efficiency by turning cells on and off. Our implementation achieves remarkable results with a power consumption of only 0.3W with L=128 and 200MHz on a Xilinx XCKU115 FPGA. Overall, our work contributes to advancing the state of the art in efficient and flexible K sorting on FPG As.
This paper introduces a novel solution utilizing a 1D convolutional neural network (CNN) with optimizations like adaptive max-pooling, point-wise convolution, and a multi-objective gradient reversal ...layer (GRL) to address real-time ventricular arrhythmia (VA) detection challenges in implantable cardioverter defibrillators (ICDs). The proposed model achieves exceptional accuracy in discerning VAs and Non-VAs from single-channel intracardiac electrogram (IEGM) signals, boasting a F β score of 0.99265, a generalization score of 0.9375, a memory footprint of 24.332 KiB, and an inference latency of 2.593 ms. Compared to top models from the 2022 TinyML Design Contest, the proposed method demonstrates superior detection accuracy and generalization performance while maintaining competitive inference latency and memory usage.
In this article, a 1.25-V 8-Gb, 16-Gb/s/pin GDDR6-based accelerator-in-memory (AiM) is presented. A dedicated command (CMD) set for deep learning (DL) is introduced to minimize latency when switching ...operation modes, and a bank-wide mantissa shift (BWMS) scheme is adopted to minimize calculation delay time, current consumption, and circuit area during multiply-accumulate (MAC) operation. By storing the lookup table (LUT) in the reserved word line in the dynamic random access memory (DRAM) bank cell, it is possible to support various activation functions (AFs), such as Gaussian error linear unit (GELU), sigmoid, and Tanh as well as rectified linear unit (ReLU) and Leaky ReLU. Performance evaluation was conducted by measuring the fabricated chip in ATE and a self-manufactured field-programmable gate array (FPGA)-based system. In the ATE-level evaluation, it operates at 16 Gbps up to a voltage as low as 1.10 V. When evaluated by GEMV and MNIST in the FPGA-based system, it was confirmed that the performance gains of 7.5-10.5 times were possible compared to the HBM2-based or GDDR6-based systems.
This poster presents system architecture, software stack, and performance analysis for SK hynix's very first GDDR6-based processing-in-memory (PIM) product sample, called Accelerator-in-Memory ...(AiM).AiM is designed for the in-memory acceleration of matrix-vector product operations, which are commonly found in machine learning applications. The strength of AiM primarily comes from the two design factors, which are 1) all-bank operation support and 2) extended DRAM command set. All-bank operations allow AiM to fully utilize the abundant internal DRAM bandwidth, which makes it an attractive solution for memory-bound applications. The extended command set allows the host to address these new operations efficiently and provides a clean separation of concerns between the AiM architecture and its software stack design.We present a dedicated FPGA-based reference platform with a software stack, which is used to validate AiM design and evaluate its system-level performance. We also demonstrate FMC-based AiM extension cards that are compatible with the off-the-shelf FPGA boards and serve as an open research platform allowing potential collaborators and academic institutes to access our hardware and software systems.
IANUS: Integrated Accelerator based on NPU-PIM Unified Memory System Seo, Minseok; Nguyen, Xuan Truong; Hwang, Seok Joong ...
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3,
04/2024
Conference Proceeding
Odprti dostop
Accelerating end-to-end inference of transformer-based large language models (LLMs) is a critical component of AI services in datacenters. However, the diverse compute characteristics of LLMs' ...end-to-end inference present challenges as previously proposed accelerators only address certain operations or stages (e.g., self-attention, generation stage, etc.). To address the unique challenges of accelerating end-to-end inference, we propose IANUS - Integrated Accelerator based on NPU-PIM Unified Memory System. IANUS is a domain-specific system architecture that combines a Neural Processing Unit (NPU) with a Processing-in-Memory (PIM) to leverage both the NPU's high computation throughput and the PIM's high effective memory bandwidth. In particular, IANUS employs a unified main memory system where the PIM memory is used both for PIM operations and for NPU's main memory. The unified main memory system ensures that memory capacity is efficiently utilized and the movement of shared data between NPU and PIM is minimized. However, it introduces new challenges since normal memory accesses and PIM computations cannot be performed simultaneously. Thus, we propose novel PIM Access Scheduling that manages not only the scheduling of normal memory accesses and PIM computations but also workload mapping across the PIM and the NPU. Our detailed simulation evaluations show that IANUS improves the performance of GPT-2 by 6.2× and 3.2×, on average, compared to the NVIDIA A100 GPU and the state-of-the-art accelerator. As a proof-of-concept, we develop a prototype of IANUS with a commercial PIM, NPU, and an FPGA-based PIM controller to demonstrate the feasibility of IANUS.