Bit-width allocation has a crucial impact on hardware efficiency and accuracy of fixed-point arithmetic circuits. This paper introduces a new accuracy-guaranteed word-length optimization approach for ...feed-forward fixed-point designs. This method uses affine arithmetic, which is a well-known analytical technique, for both range and precision analyses. This paper introduces an acceleration technique and two new semianalytical algorithms for precision analysis. While the first algorithm follows a progressive search strategy, the second one uses a tree-shaped search method for fractional width optimization. The algorithms offer two different time-complexity/cost efficiency tradeoffs. The first algorithm has polynomial complexity and achieves comparable results with existing heuristic approaches. The second algorithm has exponential complexity, but it achieves near-optimal results compared to the exhaustive search method. A commonly used set of case studies is used to evaluate the efficiency of the proposed techniques and algorithms in terms of optimization time and hardware cost. The first and second algorithms achieve 10.9% and 13.1% improvements in area, respectively, over uniform fractional width allocation. The proposed acceleration technique reduces the complexity of the fractional width selection problem by an average of 20.3%.
Spacecraft pose estimation is an essential computer vision application that can improve the autonomy of in-orbit operations. An ESA/Stanford competition brought out solutions that seem hardly ...compatible with the constraints imposed on spacecraft onboard computers. URSONet is among the best in the competition for its generalization capabilities but at the cost of a tremendous number of parameters and high computational complexity. In this paper, we propose Mobile-URSONet: a spacecraft pose estimation convolutional neural network with 178 times fewer parameters while degrading accuracy by no more than four times compared to URSONet.
Within the strongly regulated avionic engineering field, conventional graphical desktop hardware and software application programming interface (API) cannot be used because they do not conform to the ...avionic certification standards. We observe the need for better avionic graphical hardware, but system engineers lack system design tools related to graphical hardware. The endorsement of an optimal hardware architecture by estimating the performance of a graphical software, when a stable rendering engine does not yet exist, represents a major challenge. As proven by previous hardware emulation tools, there is also a potential for development cost reduction, by enabling developers to have a first estimation of the performance of its graphical engine early in the development cycle. In this paper, we propose to replace expensive development platforms by predictive software running on a desktop computer. More precisely, we present a system design tool that helps predict the rendering performance of graphical hardware based on the OpenGL Safety Critical API. First, we create nonparametric models of the underlying hardware, with machine learning, by analyzing the instantaneous frames per second (FPS) of the rendering of a synthetic 3D scene and by drawing multiple times with various characteristics that are typically found in synthetic vision applications. The number of characteristic combinations used during this supervised training phase is a subset of all possible combinations, but performance predictions can be arbitrarily extrapolated. To validate our models, we render an industrial scene with characteristic combinations not used during the training phase and we compare the predictions to those real values. We find a median prediction error of less than 4 FPS.
Full text
Available for:
DOBA, FZAB, GIS, IJS, IZUM, KILJ, NLZOH, NUK, OILJ, PILJ, PNG, SAZU, SBCE, SBMB, UILJ, UKNU, UL, UM, UPUK
Optical Network on Chip (ONoC) architectures are emerging as promising candidates to solve congestion and latency issues in future embedded systems. In this work, we examine how a scalable and fully ...connected ONoC topology can be reduced to fit specific connectivity requirements in heterogeneous 3D architectures. Through such techniques, it is possible to reduce the number of required wavelengths, laser sources, photodetectors and optical switches as well as the length of the longest optical path. This allows constraints to be relaxed on source wavelength accuracy and passive filter selectivity, and also alleviates power and area issues by reducing the number of active devices. The proposed reduction method was successfully applied to multiple heterogeneous 3D architectures.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK
We investigate trade-offs between traffic adaptation and VLSI architecture adaptation in C-RAN platforms. We propose a dynamic architectural scaling technique applied to interactive applications that ...require FFT computations. Our implementation results suggest Datapath scaling benefits applications with up to 4.89x improvement in GOPS/mW, while Fabric scaling can benefit Cloud Computing applications with up to 181.97x improvement in GOPS/mW, when compared to published methods. Improvements in network performance and energy-efficiency was achieved at a cost of built-in flexibility in the proposed VLSI architectures.
This paper investigates VLSI architectures for digital processing (DSP) functions amenable to low energy operation with scalable performance for H.265 high efficiency video coding (HEVC) ...applications. First, we describe and experimentally evaluate a novel adaptive computing fabric. Second, we propose an energy-efficient method to scale the performance of the fabric for large images or for meeting stringent real-time computation requirements. A series of tradeoffs for exploiting efficiently the application space for general purpose DSP acceleration are proposed. We experimentally show how the proposed computing fabric is reusable for Filters, FFT and DCT acceleration with a scalable throughput. We report on the design and implementation of the fabric on a Xilinx FPGA device and show how regulated-parallelism augmented with in-memory processing techniques impact performance and power efficiency. The FPGA prototype demonstrates a sustained throughput exceeding 10Gbps irrespective of the kernel and image size for H.265 HEVC applications.
Application-specific customisation of micro-processor architectures has been widely accepted as an effective way to improve the efficiency of processor-based designs. In this work, the authors ...propose a new processor customisation method based on fixed-point word-length optimisation. Accuracy-aware word-length optimisation (WLO) of fixed-point circuits is an active research area with a large body of literature. For the first time, this work introduces a method to combine the WLO with the processor customisation. The data type word-lengths, the size of register-files and the architecture of the functional units are the main target objectives to be optimised. Accuracy requirements, defined as the worst-case error bound, is the key consideration that must be met by any solution. A custom processor design environment, called PolyCuSP, is used to realise the processor architecture based on the solution found in the proposed optimisation algorithm. The results achieved by evaluating five benchmark show that this method can reduce the number of necessary LUTs and flip-flops by an average of 11.9% and 5.1%, respectively. The latency is also improved by an average of 33.4%. Moreover, the method was further examined through a case study on a JPEG decoder. The results suggest 16.2% and 56.2% reduction in area consumption and latency, respectively.
Full text
Available for:
DOBA, FZAB, GIS, IJS, KILJ, NLZOH, NUK, OILJ, SBCE, SBMB, UILJ, UKNU, UL, UM, UPUK
This article presents a pioneering approach to real-time spacecraft pose estimation, utilizing a mixed-precision quantized neural network implemented on the FPGA components of a commercially ...available Xilinx MPSoC, renowned for its suitability in space applications. Our co-design methodology includes a novel evaluation technique for assessing the layer-wise neural network sensitivity to quantization, facilitating an optimal balance between accuracy, latency, and FPGA resource utilization. Utilizing the FINN library, we developed a bespoke FPGA dataflow accelerator that integrates on-chip weights and activation functions to minimize latency and energy consumption. Our implementation is 7.7 times faster and 19.5 times more energy-efficient than the best-reported values in the existing spacecraft pose estimation literature. Furthermore, our contribution includes the first real-time, open-source implementation of such algorithms, marking a significant advancement in making efficient spacecraft pose estimation algorithms widely accessible. The source code is available at https://github.com/possoj/FPGA-SpacePose.
One of the key issues to ensure high-quality designs is the verification methodology. The typical verification methodology used for RTL design is based on the V diagram. In this article we work at ...higher levels of abstraction (named ESL) by focusing on the performance verification process. A subsystem and its interconnected components are modeled with AADL. AADL also contains constructs for modeling both software and hardware components. Through the ESL virtual platform SpaceStudio TM , we can rapidly estimate the performance on different architectures. This performance verification flow has been experimented on a Motion-JPEG video decoder application for video thumbnails that targets a Xilinx Zynq-7000 platform.
In order to make software applications simpler to write and easier to maintain, a software digital signal-processing library that performs essential signal- and image-processing functions is an ...important part of every digital signal processor (DSP) developer's toolset. In general, such a library provides high-level interface and mechanisms, therefore, developers only need to know how to use algorithms, not the details of how they work. Complex signal transformations then become function calls, e.g., C-callable functions. Considering the two-dimensional (2-D) convolver function as an example of great significance for DSP's, this paper proposes to replace this software function by an emulation on a field-programmable gate array (FPGA) initially configured by software programming. Therefore, the exploration of the 2-D convolver's design space will provide guidelines for the development of a library of DSP-oriented hardware configurations intended to significantly speed up the performance of general DSP processors. Based on the specific convolver, and considering operators supported in the library as hardware accelerators, a series of tradeoffs for efficiently exploiting the bandwidth between the general-purpose DSP and accelerators are proposed. In terms of implementation, this paper explores the performance and architectural tradeoffs involved in the design of an FPGA-based 2-D convolution coprocessor for the TMS320C40 DSP microprocessor available from Texas Instruments Incorporated. However, the proposed concept is not limited to a particular processor.