In this paper, we study asymptotic properties of the maximum likelihood estimator (MLE) for the speed of a stochastic wave equation. We follow a well-known spectral approach to write the solution as ...a Fourier series, then we project the solution to a N-finite dimensional space and find the estimator as a function of the time and N. We then show consistency of the MLE using classical stochastic analysis. Afterward, we prove the asymptotic normality using the Malliavin-Stein method. We also study asymptotic properties of a discretized version of the MLE for the parameter. We provide this asymptotic analysis of the proposed estimator as the number of Fourier modes, N, used in the estimation and the observation time go to infinity. Finally, we illustrate the theoretical results with some numerical experiments.
The maturity level of RISC-V and the availability of domain-specific instruction set extensions, like vector processing, make RISC-V a good candidate for supporting the integration of specialized ...hardware in processor cores for the High Performance Computing (HPC) application domain. In this article,1 we present Vitruvius+, the vector processing acceleration engine that represents the core of vector instruction execution in the HPC challenge that comes within the EuroHPC initiative. It implements the RISC-V vector extension (RVV) 0.7.1 and can be easily connected to a scalar core using the Open Vector Interface standard. Vitruvius+ natively supports long vectors: 256 double precision floating-point elements in a single vector register. It is composed of a set of identical vector pipelines (lanes), each containing a slice of the Vector Register File and functional units (one integer, one floating point). The vector instruction execution scheme is hybrid in-order/out-of-order and is supported by register renaming and arithmetic/memory instruction decoupling. On a stand-alone synthesis, Vitruvius+ reaches a maximum frequency of 1.4 GHz in typical conditions (TT/0.80V/25°C) using GlobalFoundries 22FDX FD-SOI. The silicon implementation has a total area of 1.3 mm2 and maximum estimated power of ∼920 mW for one instance of Vitruvius+ equipped with eight vector lanes.
Database Management Systems (DBMS) have be-come an essential tool for industry and research and are often a significant component of data centers. There have been many efforts to accelerate DBMS ...application performance. One of the most explored techniques is the use of vector processing. Unfortunately, conventional vector architectures have not been able to exploit the full potential of DBMS acceleration.In this paper, we present VAQUERO, our Scratchpad-based Vector Accelerator for QUEry pROcessing. VAQUERO improves the efficiency of vector architectures for DBMS operations such as data aggregation and hash joins featuring lookup tables. Lookup tables are significant contributors to the performance bottlenecks in DBMS processing suffering from insufficient ISA support in the form of scatter-gather instructions. VAQUERO introduces a novel Advanced Scratchpad Memory specifically designed with two mapping modes - direct- and associative-mode. These map-ping modes enable VAQUERO to accelerate real-world databases with workload sizes that significantly exceed the scratchpad memory capacity. Additionally, the associative-mode allows to use VAQUERO with DBMS operators that use hashed keys, e.g. hash-join and hash-aggregate. VAQUERO has been designed considering general DBMS algorithm requirements instead of being based on a particular database organization. For this reason, VAQUERO is capable to accelerate DBMS operators for both row- and column-oriented databases.In this paper, we evaluate the efficiency of VAQUERO using two highly optimized popular open-source DBMS, namely the row-based PostgreSQL and column-based MonetDB. We imple-mented VAQUERO at the RTL level and prototype it, by performing Place&Route, at the 7nm technology node. VAQUERO incurs a modest 0.15% area overhead compared with an Intel Ice Lake processor. Our evaluation shows that VAQUERO significantly outperforms PostgreSQL and MonetDB by 2.09× and 3.32× respectively, when processing operators and queries from the TPC-H benchmark.
Sparse matrix operations are critical kernels in multiple application domains such as High Performance Computing, artificial intelligence and big data. Vector processing is widely used to improve ...performance on mathematical kernels with dense matrices. Unfortunately, existing vector architectures do not cope well with sparse matrix computations, achieving much lower performance in comparison with their dense counterparts.To overcome this limitation, we present the Vector Indexed Architecture (VIA), a novel hardware vector architecture that accelerates applications with irregular memory access patterns such as sparse matrix computations. There are two main bottlenecks when computing with sparse matrices: irregular memory accesses and index matching. VIA addresses these two bottlenecks with a smart scratchpad that is tightly coupled to the Vector Functional Units within the core.Thanks to this structure, VIA improves locality for sparse-dense computations and improves the index matching search process for sparse computations. As a result, VIA achieves significant performance speedup over highly optimized state-of-the-art C++ algebra libraries. On average, VIA outperforms sparse matrix vector multiplication, sparse matrix addition and sparse matrix matrix multiplication kernels by 4.22 ×, 6.14 × and 6.00 ×, respectively, when evaluated over a thousand sparse matrices that arise in real applications. In addition, we prove the generality of VIA by showing that it can accelerate histogram and stencil applications by 4.5 × and 3.5 ×, respectively.
Genome sequence analysis is fundamental to medical breakthroughs such as developing vaccines, enabling genome editing, and facilitating personalized medicine. The exponentially expanding sequencing ...datasets and complexity of sequencing algorithms necessitate performance enhancements. While the performance of software solutions is constrained by their underlying hardware platforms, the utility of fixed-function accelerators is restricted to only certain sequencing algorithms.This paper presents QUETZAL, the first general-purpose vector acceleration framework designed for high efficiency and broad applicability across a diverse set of genomics algorithms. While a commercial CPU's vector datapath is a promising candidate to exploit the data-level parallelism in genomics algorithms, our analysis finds that its performance is often limited due to long-latency scatter/gather memory instructions. QUETZAL introduces a hardware-software co-design comprising an accelerator microarchitecture closely integrated with the CPU's vector datapath, alongside novel vector instructions to fully capitalize on the proposed hardware. QUETZAL integrates a set of scratchpad-style buffers meticulously designed to minimize latency associated with scatter/gather instructions during the retrieval of input genome sequences data. QUETZAL supports both short and long reads, and different types of sequencing data formats. A combination of hardware and software techniques enables QUETZAL to reduce the latency of memory instructions, perform complex computation using a single instruction, and transform data representations at runtime, resulting in overall efficiency gain. QUETZAL significantly accelerates a vectorized CPU baseline on modern genome sequence analysis algorithms by 5.7×, while incurring a small area overhead of 1.4% post place-and-route at the 7nm technology node compared to an HPC ARM CPU.
This paper presents a deeply pipelined and massively parallel Binary Search Tree (BST) accelerator for Field Programmable Gate Arrays (FPGAs). Our design relies on the extremely parallel on-chip ...memory, or Block RAMs (BRAMs) architecture of FPGAs. To achieve significant throughput for the search operation on BST, we present several novel mechanisms including tree duplication as well as horizontal, duplicated, and hybrid (horizontal-vertical) tree partitioning. Also, we present efficient techniques to decrease the stalling rates that can occur during the parallel tree search. By combining these techniques and implementations on Xilinx Virtex-7 VC709 platform, we achieve up to 8X throughput improvement gain in comparison to the baseline implementation, i.e., a fully-pipelined FPGA-based accelerator.
This paper presents a deeply pipelined and massively parallel Binary Search Tree (BST) accelerator for Field Programmable Gate Arrays (FPGAs). Our design relies on the extremely parallel on-chip ...memory, or Block RAMs (BRAMs) architecture of FPGAs. To achieve significant throughput for the search operation on BST, we present several novel mechanisms including tree duplication as well as horizontal, duplicated, and hybrid (horizontal-vertical) tree partitioning. Also, we present efficient techniques to decrease the stalling rates that can occur during the parallel tree search. By combining these techniques and implementations on Xilinx Virtex-7 VC709 platform, we achieve up to 8X throughput improvement gain in comparison to the baseline implementation, i.e., a fully-pipelined FPGA-based accelerator.
This paper describes the Sargantana System on chip (SoC), a 64-bit RISC-V single core processor designed by a number of academic institutions and manufactured in 22 nm FDSOI technology: BSC, UPC, UB, ...UAB, CIC-IPN and IMB-CNM (CSIC). The SoC includes the processor as well as, among other components, a Phase Locked Loop (PLL) operating up to 2 GHz, interfaces to HyperRAM and a Serdes up to 8 Gbps. The processor has demonstrated experimental correct operation at 800 MHz.
This paper describes the design, verification, implementation and fabrication of the Drac Vector IN-Order (DVINO) processor, a RISC-V vector processor capable of booting Linux jointly developed by ...BSC, CIC-IPN, IMB-CNM (CSIC), and UPC. The DVINO processor includes an internally developed two-lane vector processor unit as well as a Phase Locked Loop (PLL) and an Analog-to-Digital Converter (ADC). The paper summarizes the design from architectural as well as logic synthesis and physical design in CMOS 65nm technology.
The design presented in this paper, called preDRAC, is a RISC-V general purpose processor capable of booting Linux jointly developed by BSC, CIC-IPN, IMB-CNM (CSIC), and UPC. The preDRAC processor is ...the first RISC-V processor designed and fabricated by a Spanish or Mexican academic institution, and will be the basis of future RISC-V designs jointly developed by these institutions. This paper summarizes the design tasks, for FPGA first and for SoC later, from high architectural level descriptions down to RTL and then going through logic synthesis and physical design to get the layout ready for its final tapeout in CMOS 65nm technology.