In limited-resource edge computing circumstances such as on mobile devices, IoT devices, and electric vehicles, the energy-efficient optimized convolutional neural network (CNN) accelerator ...implemented on mobile Field Programmable Gate Array (FPGA) is becoming more attractive due to its high accuracy and scalability. In recent days, mobile FPGAs such as the Xilinx PYNQ-Z1/Z2 and Ultra96, definitely have the advantage of scalability and flexibility for the implementation of deep learning algorithm-based object detection applications. It is also suitable for battery-powered systems, especially for drones and electric vehicles, to achieve energy efficiency in terms of power consumption and size aspect. However, it has the low and limited performance to achieve real-time processing. In this article, optimizing the accelerator design flow in the register-transfer level (RTL) will be introduced to achieve fast programming speed by applying low-power techniques on FPGA accelerator implementation. In general, most accelerator optimization techniques are conducted on the system level on the FPGA. In this article, we propose the reconfigurable accelerator design for a CNN-based object detection system on the register-transfer level on mobile FPGA. Furthermore, we present RTL optimization design techniques that will be applied such as various types of clock gating techniques to eliminate residual signals and to deactivate the unnecessarily active block. Based on the analysis of the CNN-based object detection architecture, we analyze and classify the common computing operation components from the Convolutional Neuron Network, such as multipliers and adders. We implement a multiplier/adder unit to a universal computing unit and modularize it to be suitable for a hierarchical structure of RTL code. The proposed system design was tested with Resnet-20 which has 23 layers and it was trained with the dataset, CIFAR-10 which provides a test set of 10,000 images in several formats, and the weight data we used for this experiment was provided from Tensil. Experimental results show that the proposed design process improves the power efficient consumption, hardware utilization, and throughput by 16%, up to 58%, and 15%, respectively.
Automation of vehicles requires a secure, reliable, and real-time on-chip system. These systems also require very high-speed and compact on-chip analog to digital converters (ADC). The conventional ...ADC cannot fulfill this requirement. In this paper, we proposed a Darlington pair- and source biasing-based high speed, secure, and reliable voltage to time converter (VTC). It is a compact, high-speed design and gives high conversion gain. The source biasing also helps to increase the input voltage range. The conversion gain of the proposed circuit is 101.43ns/v, which is 52 times greater than the gain achieved by state-of-the-art design. It also shows less effect of process variation and bias temperature instability.
Convolutional neural networks (CNNs) based deep learning algorithms require high data flow and computational intensity. For real-time industrial applications, they need to overcome challenges such as ...high data bandwidth requirement and power consumption on hardware platforms. In this work, we have analyzed in detail the data dependency in the CNN accelerator and propose specific pipelined operations and data organized manner to design a high throughput CNN accelerator on FPGA. Besides, we have optimized the kernel operations to obtain a high power efficiency. The proposed CNN accelerator supports image classification and real-time object detection with high accuracy. The evaluation results show that our CNN-based FPGA accelerator can achieve 740 Giga operations per second (GOPS) at 200 MHz with kernel power of 12.2 watts on Intel Arria 10 FPGA. For object detection tasks, our system can achieve 105 fps with 56.5 mAP or 25 fps with 73.6 mAP on VOC dataset. Since we use the mixed fixed-point data representation, the detection accuracy is comparable with the GPU-based YOLO V2 framework. The power efficiency of our system is <inline-formula> <tex-math notation="LaTeX">\sim 3.3 \times </tex-math></inline-formula> better than Titan X GPU and <inline-formula> <tex-math notation="LaTeX">\sim 418 \times </tex-math></inline-formula> better than Intel E5-2620 V4 CPU.
Deep neural networks (DNNs) and Convolutional neural networks (CNNs) have improved accuracy in many Artificial Intelligence (AI) applications. Some of these applications are recognition and detection ...tasks, such as speech recognition, facial recognition and object detection. On the other hand, CNN computation requires complex arithmetic and a lot of memory access time; thus, designing new hardware that would increase the efficiency and throughput without increasing the hardware cost is much more critical. This area in hardware design is very active and will continue to be in the near future. In this paper, we propose a novel 8T XNOR-SRAM design for Binary/Ternary DNNs (TBNs) directly supporting the XNOR-Network and the TBN DNNs. The proposed SRAM Computing-in-Memory (CIM) can operate in two modes, the first of which is the conventional 6T SRAM, and the second is the XNOR mode. By adding two extra transistors to the conventional 6T structure, our proposed CIM showed an improvement up to 98% for power consumption and 90% for delay compared to the existing state-of-the-art XNOR-CIM.
As AI models grow in complexity to enhance accuracy, supporting hardware encounters challenges such as heightened power consumption and diminished processing speed due to high throughput demands. ...Compute-in-memory (CIM) technology emerges as a promising solution. Furthermore, carbon nanotube field-effect transistors (CNFETs) show significant potential in bolstering CIM technology. Despite advancements in silicon semiconductor technology, CNFETs pose as formidable competitors, offering advantages in reliability, performance, and power efficiency. This is particularly pertinent given the ongoing challenges posed by the reduction in silicon feature size. We proposed an ultra-low-power architecture leveraging CNFETs for Binary Neural Networks (BNNs), featuring an advanced state-of-the-art 8T SRAM bit cell and CNFET model to optimize performance in intricate AI computations. Through meticulous optimization, we fine-tune the CNFET model by adjusting tube counts and chiral vectors, as well as optimizing transistor ratios for SRAM transistors and nanotube diameters. SPICE simulation in 32 nm CNFET technology facilitates the determination of optimal transistor ratios and chiral vectors across various nanotube diameters under a 0.9 V supply voltage. Comparative analysis with conventional FinFET-based CIM structures underscores the superior performance of our CNFET SRAM-based CIM design, boasting a 99% reduction in power consumption and a 91.2% decrease in delay compared to state-of-the-art designs.
In the implementation process of a convolution neural network (CNN)-based object detection system, the primary issues are power dissipation and limited throughput. Even though we utilize ultra-low ...power dissipation devices, the dynamic power dissipation issue will be difficult to resolve. During the operation of the CNN algorithm, there are several factors such as the heating problem generated from the massive computational complexity, the bottleneck generated in data transformation and by the limited bandwidth, and the power dissipation generated from redundant data access. This article proposes the low-power techniques, applies them to the CNN accelerator on the FPGA and ASIC design flow, and evaluates them on the Xilinx ZCU-102 FPGA SoC hardware platform and 45 nm technology for ASIC, respectively. Our proposed low-power techniques are applied at the register-transfer-level (RT-level), targeting FPGA and ASIC. In this article, we achieve up to a 53.21% power reduction in the ASIC implementation and saved 32.72% of the dynamic power dissipation in the FPGA implementation. This shows that our RTL low-power schemes have a powerful possibility of dynamic power reduction when applied to the FPGA design flow and ASIC design flow for the implementation of the CNN-based object detection system.
We propose a novel ultra-low-power, voltage-based compute-in-memory (CIM) design with a new single-ended 8T SRAM bit cell structure. Since the proposed SRAM bit cell uses a single bitline for CIM ...calculation with decoupled read and write operations, it supports a much higher energy efficiency. In addition, to separate read and write operations, the stack structure of the read unit minimizes leakage power consumption. Moreover, the proposed bit cell structure provides better read and write stability due to the isolated read path, write path and greater pull-up ratio. Compared to the state-of-the-art SRAM-CIM, our proposed SRAM-CIM does not require extra transistors for CIM vector-matrix multiplication. We implemented a 16 k (128 × 128) bit cell array for the computation of 128× neurons, and used 64× binary inputs (0 or 1) and 64 × 128 binary weights (−1 or +1) values for the binary neural networks (BNNs). Each row of the bit cell array corresponding to a single neuron consists of a total of 128 cells, 64× cells for dot-product and 64× replicas cells for ADC reference. Additionally, 64× replica cells consist of 32× cells for ADC reference and 32× cells for offset calibration. We used a row-by-row ADC for the quantized outputs of each neuron, which supports 1–7 bits of output for each neuron. The ADC uses the sweeping method using 32× duplicate bit cells, and the sweep cycle is set to 2N−1+1, where N is the number of output bits. The simulation is performed at room temperature (27 °C) using 45 nm technology via Synopsys Hspice, and all transistors in bitcells use the minimum size considering the area, power, and speed. The proposed SRAM-CIM has reduced power consumption for vector-matrix multiplication by 99.96% compared to the existing state-of-the-art SRAM-CIM. Furthermore, because of the decoupled reading unit from an internal node of latch, there is no feedback from the reading unit, with read static noise, and margin-free results.
The machine learning and convolutional neural network (CNN)-based intelligent artificial accelerator needs significant parallel data processing from the cache memory. The separate read port is mostly ...used to design built-in computational memory (CRAM) to reduce the data processing bottleneck. This memory uses multi-port reading and writing operations, which reduces stability and reliability. In this paper, we proposed a self-adaptive 12T SRAM cell to increase the read stability for multi-port operation. The self-adaptive technique increases stability and reliability. We increased the read stability by refreshing the storing node in the read mode of operation. The proposed technique also prevents the bit-interleaving problem. Further, we offered a butterfly-inspired SRAM bank to increase the performance and reduce the power dissipation. The proposed SRAM saves 12% more total power than the state-of-the-art 12T SRAM cell-based SRAM. We improve the write performance by 28.15% compared with the state-of-the-art 12T SRAM design. The total area overhead of the proposed architecture compared to the conventional 6T SRAM cell-based SRAM is only 1.9 times larger than the 6T SRAM cell.
Voltage-to-time and current-to-time converters have been used in many recent works as a voltage-to-digital converter for artificial intelligence applications. In general, most of the previous designs ...use the current-starved technique or a capacitor-based delay unit, which is non-linear, expensive, and requires a large area. In this paper, we propose a highly linear current-to-digital converter. An optimization method is also proposed to generate the optimal converter design containing the smallest number of PMOS and sensitive circuits such as a differential amplifier. This enabled our design to be more stable and robust toward negative bias temperature instability (NBTI) and process variation. The proposed converter circuit implements the point-wise conversion from current-to-time, and it can be used directly for a variety of applications, such as analog-to-digital converters (ADC), used in built-in computational random access (C-RAM) memory. The conversion gain of the proposed circuit is 3.86 ms/A, which is 52 times greater than the conversion gains of state-of-the-art designs. Further, various time-to-digital converter (TDC) circuits are reviewed for the proposed current-to-time converter, and we recommend one circuit for a complete ADC design.
This article addresses the growing need in resource-constrained edge computing scenarios for energy-efficient convolutional neural network (CNN) accelerators on mobile Field-Programmable Gate Array ...(FPGA) systems. In particular, we concentrate on register transfer level (RTL) design flow optimization to improve programming speed and power efficiency. We present a re-configurable accelerator design optimized for CNN-based object-detection applications, especially suitable for mobile FPGA platforms like the Xilinx PYNQ-Z2. By not only optimizing the MAC module using Enhanced clock gating (ECG), the accelerator can also use low-power techniques such as Local explicit clock gating (LECG) and Local explicit clock enable (LECE) in memory modules to efficiently minimize data access and memory utilization. The evaluation using ResNet-20 trained on the CIFAR-10 dataset demonstrated significant improvements in power efficiency consumption (up to 22%) and performance. The findings highlight the importance of using different optimization techniques across multiple hardware modules to achieve better results in real-world applications.