Bit-width allocation has a crucial impact on hardware efficiency and accuracy of fixed-point arithmetic circuits. This paper introduces a new accuracy-guaranteed word-length optimization approach for ...feed-forward fixed-point designs. This method uses affine arithmetic, which is a well-known analytical technique, for both range and precision analyses. This paper introduces an acceleration technique and two new semianalytical algorithms for precision analysis. While the first algorithm follows a progressive search strategy, the second one uses a tree-shaped search method for fractional width optimization. The algorithms offer two different time-complexity/cost efficiency tradeoffs. The first algorithm has polynomial complexity and achieves comparable results with existing heuristic approaches. The second algorithm has exponential complexity, but it achieves near-optimal results compared to the exhaustive search method. A commonly used set of case studies is used to evaluate the efficiency of the proposed techniques and algorithms in terms of optimization time and hardware cost. The first and second algorithms achieve 10.9% and 13.1% improvements in area, respectively, over uniform fractional width allocation. The proposed acceleration technique reduces the complexity of the fractional width selection problem by an average of 20.3%.
This letter introduces an original and highly efficient method to implement high-capacity content-addressable memories on field programmable gate arrays (FPGAs). The method includes a new hardware ...architecture and an optimization technique to determine crucial design parameters. The memory contents are partially synthesized and implemented on FPGA logic fabrics. The proposed architecture offers high throughput and fixed-latency searches. Experimental results show that the proposed method enables the implementation of an IPv4 forwarding table with over 520 K prefixes on a cost-effective AMD-Xilinx UltraScale + FPGA, providing a lookup latency of less than 28 ns and a minimum throughput of 215 million lookups per second. The source code of this work is available on GitHub.
Multipliers are widely-used arithmetic operators in digital signal processing and machine learning circuits. Due to their relatively high complexity, they can have high latency and be a significant ...source of power consumption. One strategy to alleviate these limitations is to use approximate computing. This paper thus introduces an original FPGA-based approximate multiplier specifically optimized for machine learning computations. It utilizes dynamically reconfigurable lookup table (LUT) primitives in AMD-Xilinx technology to realize the core part of the computations. The paper provides an in-depth analysis of the hardware architecture, implementation outcomes, and accuracy evaluations of the multiplier proposed in INT8 precision. The paper also facilitates the generalization of the proposed approximate multiplier idea to other datatypes, providing analysis and estimations for hardware cost and accuracy as a function of multiplier parameters. Implementation results on an AMD-Xilinx Kintex Ultrascale+ FPGA demonstrate remarkable savings of 64% and 67% in LUT utilization for signed multiplication and multiply-and-accumulation configurations, respectively when compared to the standard Xilinx multiplier core. Accuracy measurements on four popular deep learning (DL) benchmarks indicate a minimal average accuracy decrease of less than 0.29% during post-training deployment, with the maximum reduction staying less than 0.33%. The source code of this work is available on GitHub.
Convolutional Neural Networks (CNNs) have proven to be extremely accurate for image recognition, even outperforming human recognition capability. When deployed on battery-powered mobile devices, ...efficient computer architectures are required to enable fast and energy-efficient computation of costly convolution operations. Despite recent advances in hardware accelerator design for CNNs, two major problems have not yet been addressed effectively, particularly when the convolution layers have highly diverse structures: (1) minimizing energy-hungry off-chip DRAM data movements; (2) maximizing the utilization factor of processing resources to perform convolutions. This work thus proposes an energy-efficient architecture equipped with several optimized dataflows to support the structural diversity of modern CNNs. The proposed approach is evaluated on convolutional layers of VGGNet-16 and ResNet-50. Results show that the architecture achieves a Processing Element (PE) utilization factor of 98% for the majority of <inline-formula> <tex-math notation="LaTeX">3\times 3 </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">1\times 1 </tex-math></inline-formula> convolutional layers, while limiting latency to 396.9 ms and 92.7 ms when performing convolutional layers of VGGNet-16 and ResNet-50, respectively. In addition, the proposed architecture benefits from the structured sparsity in ResNet-50 to reduce the latency to 42.5 ms when half of the channels are pruned.
This paper presents a new approach for implementing high-capacity content-addressable memories on field programmable gate arrays (FPGAs). This approach introduces a novel configurable hardware ...architecture complemented by a multi-objective heuristic optimization algorithm. This algorithm explores the design space and identifies the near-optimal configuration parameter values for a given search table content. In this approach, the matching operation is carried out partially using synthesized circuits within the FPGA logic fabric and partly relies on an SRAM-based bitmapping technique. The balance between logic and memory resource utilization can be adjusted to accommodate constraints and design priorities. The approach supports large search tables, offers high throughput and short-latency searches, and can be reconfigured to adapt to new matching rules. This adaptability makes it particularly well-suited for IP address lookup in SDN-enabled data planes. Experimental results demonstrate the effectiveness of this method. It enables the implementation of an IPv4 forwarding table with more than 520,000 prefixes on a cost-effective AMD-Xilinx UltraScale+ FPGA. This implementation delivers a lookup latency of under 26 ns and a throughput of over 235 million lookups per second. The source code for this work is accessible on GitHub.
The evolvable multiprocessor (EvoMP), as a novel multiprocessor system-on-chip (MPSoC) machine with evolvable task decomposition and scheduling, claims a major feature of low-cost and efficient fault ...tolerance. Non-centralized control and adaptive distribution of the program among the available processors are two major capabilities of this platform, which remarkably help to achieve an efficient fault tolerance scheme. This letter presents the operational as well as architectural details of this fault tolerance scheme. In this method, when a processor becomes faulty, it will be eliminated of contribution in program execution in remaining run-time. This method also utilizes dynamic rescheduling capability of the system to achieve the maximum possible efficiency after processor reduction. The results confirm the efficiency and remarkable advantages of the proposed approach over common redundancy based techniques in similar systems.
Full text
Available for:
EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
Bloom filters (BFs) are widely utilised to speed up string matching in crucial network applications such as real-time intrusion detection and spam filters. This study introduces a new approach to ...improve the efficiency of BFs for string matching functions. The approach splits each target string into two substrings and considers the second substring for programming the BF. The objective is to minimise the false positive rate by maximising the common hash signatures from the second substring. Results show that compared to the traditional means of using BFs, the proposed approach reduces the false positive rate by averages of 76 and 88% for 32 and 64 Kb BFs, respectively. Moreover, a complete string matching architecture has been developed in hardware based on the proposed approach. Results demonstrate the advantages of this new architecture compared to similar previous works.
Full text
Available for:
FZAB, GIS, IJS, KILJ, NLZOH, NUK, OILJ, SBCE, SBMB, UL, UM, UPUK
Abstract When assessing individuals with alcohol use disorders, measurement of drinking can be a resource intensive activity, particularly because many research studies report data for intervals ...ranging from 6 to 12 months prior to the interview. This study examined whether data from shorter assessment intervals is sufficiently representative of longer intervals to warrant the use of shorter intervals for clinical and research purposes. Participants were 825 problem drinkers (33.1% female) who were recruited through media advertisements to participate in a community-based mail intervention in Toronto, Canada. Participants' Timeline Followback (TLFB) reports of drinking were used to investigate the representativeness of different time windows for estimating annual drinking behavior. The findings suggest that for aggregated reports of drinking and with large sample (e.g., surveys), a 1-month window can be used to estimate annual consumption. For individual cases (e.g., clinical use) and smaller samples, a 3-month window is recommended. These results suggest that shorter time windows, which are more time and resource efficient, can be used with little to no loss in the accuracy of the data.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK
Application-specific customisation of micro-processor architectures has been widely accepted as an effective way to improve the efficiency of processor-based designs. In this work, the authors ...propose a new processor customisation method based on fixed-point word-length optimisation. Accuracy-aware word-length optimisation (WLO) of fixed-point circuits is an active research area with a large body of literature. For the first time, this work introduces a method to combine the WLO with the processor customisation. The data type word-lengths, the size of register-files and the architecture of the functional units are the main target objectives to be optimised. Accuracy requirements, defined as the worst-case error bound, is the key consideration that must be met by any solution. A custom processor design environment, called PolyCuSP, is used to realise the processor architecture based on the solution found in the proposed optimisation algorithm. The results achieved by evaluating five benchmark show that this method can reduce the number of necessary LUTs and flip-flops by an average of 11.9% and 5.1%, respectively. The latency is also improved by an average of 33.4%. Moreover, the method was further examined through a case study on a JPEG decoder. The results suggest 16.2% and 56.2% reduction in area consumption and latency, respectively.
Full text
Available for:
DOBA, FZAB, GIS, IJS, KILJ, NLZOH, NUK, OILJ, SBCE, SBMB, UILJ, UKNU, UL, UM, UPUK
This paper presents a novel hardware framework of particle swarm optimization (PSO) for various kinds of discrete optimization problems based on the system-on-a-programmable-chip (SOPC) concept. PSO ...is a new optimization algorithm with a growing field of applications. Nevertheless, similar to the other evolutionary algorithms, PSO is generally a computationally intensive method which suffers from long execution time. Hence, it is difficult to use PSO in real-time applications in which reaching a proper solution in a limited time is essential. SOPC offers a platform to effectively design flexible systems with a high degree of complexity. A hardware pipelined PSO (PPSO) Core is applied with which the required computational operations of the algorithm are performed. Embedded processors have also been employed to evaluate the fitness values by running programmed software codes. Applying the subparticle method brings the benefit of full scalability to the framework and makes it independent of the particle length. Therefore, more complex and larger problems can be addressed without modifying the architecture of the framework. To speed up the computations, the optimization architecture is implemented on a single chip master–slave multiprocessor structure. Moreover, the asynchronous model of PSO gains parallel efficacy and provides an approach to update particles continuously. Five benchmarks are exploited to evaluate the effectiveness and robustness of the system. The results indicate a speed-up of up to 98 times over the software implementation in the elapsed computation time. Besides, the PPSO Core has been employed for neural network training in an SOPC-based embedded system which approves the system applicability for real-world applications.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK