Arm usage has substantially grown in the High-Performance Computing (HPC) community. Japanese supercomputer Fugaku, powered by Arm-based A64FX processors, held the top position on the Top500 list ...between June 2020 and June 2022, currently sitting in the fourth position. The recently released 7th generation of Amazon EC2 instances for compute-intensive workloads (C7 g) is also powered by Arm Graviton3 processors. Projects like European Mont-Blanc and U.S. DOE/NNSA Astra are further examples of Arm irruption in HPC. In parallel, over the last decade, the rapid improvement of genomic sequencing technologies and the exponential growth of sequencing data has placed a significant bottleneck on the computational side. While most genomics applications have been thoroughly tested and optimized for x86 systems, just a few are prepared to perform efficiently on Arm machines. Moreover, these applications do not exploit the newly introduced Scalable Vector Extensions (SVE).
This paper presents GenArchBench, the first genome analysis benchmark suite targeting Arm architectures. We have selected computationally demanding kernels from the most widely used tools in genome data analysis and ported them to Arm-based A64FX and Graviton3 processors. Overall, the GenArch benchmark suite comprises 13 multi-core kernels from critical stages of widely-used genome analysis pipelines, including base-calling, read mapping, variant calling, and genome assembly. Our benchmark suite includes different input data sets per kernel (small and large), each with a corresponding regression test to verify the correctness of each execution automatically. Moreover, the porting features the usage of the novel Arm SVE instructions, algorithmic and code optimizations, and the exploitation of Arm-optimized libraries. We present the optimizations implemented in each kernel and a detailed performance evaluation and comparison of their performance on four different HPC machines (i.e., A64FX, Graviton3, Intel Xeon Skylake Platinum, and AMD EPYC Rome). Overall, the experimental evaluation shows that Graviton3 outperforms other machines on average. Moreover, we observed that the performance of the A64FX is significantly constrained by its small memory hierarchy and latencies. Additionally, as proof of concept, we study the performance of a production-ready tool that exploits two of the ported and optimized genomic kernels.
•GenArchBench is the first genomics benchmark suite targeting the Arm architecture.•GenArchBench introduces SVE implementations of kernels that cannot be auto-vectorized.•Arm SVE enhances code maintainability and binary portability across Arm processors.•AWS Graviton3 performs better than Intel Xeon Skylake Platinum and AMD EPYC Rome.•A64FX’s limited memory hierarchy causes inefficiencies due to high memory latencies.
This paper presents a vector relation-centric algorithm for solving arithmetic word problems (AWPs), which uses vector relation acquisition and scene knowledge to ensure the performances of problem ...understanding and symbolic solver correspondingly. The vector relation acquisition procedure builds on the synergy of the vector syntax-semantics method and the deep neural miner. Compared with the syntax-semantics method, the vector syntax-semantics method decreases not only the number of models but also semantic ambiguities and computational costs. For the scene knowledge, this paper proposes a scene-aware symbolic solver which infers relations obeying scene rules to decrease the occurrences of unwanted operations. Experimental results show that the proposed algorithm is superior to the high-performance baseline algorithm in both accuracy and computational cost. In accuracy, the proposed algorithm increased the accuracy by 3.9% on the sum of the three scene subsets due to the use of the scene knowledge and vector computing; as a result, it increased the accuracy by 0.5% on the sum of six authoritative datasets. In computational cost, the proposed algorithm decreased the computing cost by more than 50%. Thus, this paper makes a significant contribution to developing instruments for solving AWPs.
Database Management Systems (DBMS) have be-come an essential tool for industry and research and are often a significant component of data centers. There have been many efforts to accelerate DBMS ...application performance. One of the most explored techniques is the use of vector processing. Unfortunately, conventional vector architectures have not been able to exploit the full potential of DBMS acceleration.In this paper, we present VAQUERO, our Scratchpad-based Vector Accelerator for QUEry pROcessing. VAQUERO improves the efficiency of vector architectures for DBMS operations such as data aggregation and hash joins featuring lookup tables. Lookup tables are significant contributors to the performance bottlenecks in DBMS processing suffering from insufficient ISA support in the form of scatter-gather instructions. VAQUERO introduces a novel Advanced Scratchpad Memory specifically designed with two mapping modes - direct- and associative-mode. These map-ping modes enable VAQUERO to accelerate real-world databases with workload sizes that significantly exceed the scratchpad memory capacity. Additionally, the associative-mode allows to use VAQUERO with DBMS operators that use hashed keys, e.g. hash-join and hash-aggregate. VAQUERO has been designed considering general DBMS algorithm requirements instead of being based on a particular database organization. For this reason, VAQUERO is capable to accelerate DBMS operators for both row- and column-oriented databases.In this paper, we evaluate the efficiency of VAQUERO using two highly optimized popular open-source DBMS, namely the row-based PostgreSQL and column-based MonetDB. We imple-mented VAQUERO at the RTL level and prototype it, by performing Place&Route, at the 7nm technology node. VAQUERO incurs a modest 0.15% area overhead compared with an Intel Ice Lake processor. Our evaluation shows that VAQUERO significantly outperforms PostgreSQL and MonetDB by 2.09× and 3.32× respectively, when processing operators and queries from the TPC-H benchmark.
Sparse matrix operations are critical kernels in multiple application domains such as High Performance Computing, artificial intelligence and big data. Vector processing is widely used to improve ...performance on mathematical kernels with dense matrices. Unfortunately, existing vector architectures do not cope well with sparse matrix computations, achieving much lower performance in comparison with their dense counterparts.To overcome this limitation, we present the Vector Indexed Architecture (VIA), a novel hardware vector architecture that accelerates applications with irregular memory access patterns such as sparse matrix computations. There are two main bottlenecks when computing with sparse matrices: irregular memory accesses and index matching. VIA addresses these two bottlenecks with a smart scratchpad that is tightly coupled to the Vector Functional Units within the core.Thanks to this structure, VIA improves locality for sparse-dense computations and improves the index matching search process for sparse computations. As a result, VIA achieves significant performance speedup over highly optimized state-of-the-art C++ algebra libraries. On average, VIA outperforms sparse matrix vector multiplication, sparse matrix addition and sparse matrix matrix multiplication kernels by 4.22 ×, 6.14 × and 6.00 ×, respectively, when evaluated over a thousand sparse matrices that arise in real applications. In addition, we prove the generality of VIA by showing that it can accelerate histogram and stencil applications by 4.5 × and 3.5 ×, respectively.
The main idea is to create logic-free vector computing, using only read-write transactions on address memory. The strategic goal is to create a deterministic vector-quantum computing using photons ...for read-write transactions on stable subatomic memory elements. The main task is to implement new vector computing models and methods based on primitive read-write transactions in vector flexible interpretive fault modeling and simulation technology, where data is used as addresses for processing the data itself. The essence of vector computing is read-write transactions on vector data structures in address memory. Vector computing is a computational process based on elementary read-write transactions over cells of binary vectors that are stored in address memory and form a functionality where the input data to be processed is the addresses of these cells. A metric-axiom of convolutional closure of cyclic distances between n objects into the 0-space is introduced. A model of xor-relations between wonderful logical functions (\mathbf{and}\oplus or \oplus xor =0) of digital circuits is proposed; it is convoluted into zero-space, which makes it possible to solve problems of technical diagnosis, generative machine learning, similarity-difference search between processes and phenomena. A failure-driven management metric \mathbf{T}\oplus \mathbf{F}\oplus \mathbf{L}=0 is introduced, which formalizes all known processes for creating computing, including design and test, cyber-physical and cyber-social computing, federated and generative ML-computing. The equation of logical analysis is introduced \mathbf{S}\oplus \mathbf{D}=\mathbf{a}\cup \mathbf{b}=\mathbf{U} , which makes it possible to calculate the similarity-difference between objects by means of parallel logical procedures on binary vectors that form the U-universe. The advantages of a vector universal model for a compact description of ordered processes, phenomena, functions, and structures are defined for the purpose of their parallel analysis. Analytical expressions of logic, which require algorithmically complex calculators, are replaced by output state vectors of elements and digital circuits, focused on the parallelism of register logical procedures on regular data structures. A vector-deductive method for formula synthesis for propagating input lists (data) of faults is proposed, which has a quadratic computational complexity of register operations. A new matrix of deductive vectors has been synthesized, which is characterized by the following properties: compactness, parallel data processing based on a single read-write transaction in memory, elimination of traditional logic from fault simulation procedures, full automation of its synthesis process, and focus on technological solving all problems of technical diagnosis. A new structure of the vector deductive fault simulation sequencer is proposed, which is characterized by easily implementation on a single memory block, eliminates any traditional logic, uses data read-write transactions in memory to generate an output fault vector, uses data as addresses to process the data itself.
Short Reasons for Long Vectors in HPC CPUs: A Study Based on RISC-V Vizcaino, Pablo; Ieronymakis, Georgios; Dimou, Nikolaos ...
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis,
11/2023
Conference Proceeding
Odprti dostop
For years, SIMD/vector units have enhanced the capabilities of modern CPUs in High-Performance Computing (HPC) and mobile technology. Typical commercially-available SIMD units process up to 8 ...double-precision elements with one instruction. The optimal vector width and its impact on CPU throughput due to memory latency and bandwidth remain challenging research areas. This study examines the behavior of four computational kernels on a RISC-V core connected to a customizable vector unit, capable of operating up to 256 double precision elements per instruction. The four codes have been purposefully selected to represent non-dense workloads: SpMV, BFS, PageRank, FFT. The experimental setup allows us to measure their performance while varying the vector length, the memory latency, and bandwidth. Our results not only show that larger vector lengths allow for better tolerance of limitations in the memory subsystem but also offer hope to code developers beyond dense linear algebra.
A fiber-reinforced composite simulation based on a spring-element model can capture the stress field of composites. Since the simulation can calculate the maximum stress to composites, it is ...essential to discover a more reliable composite structure, and the acceleration of the simulation is required. However, a linear matrix solver in the simulation needs a long execution time and a vast memory bandwidth to solve element displacements of the structure, especially for large-scale composites. To accelerate the simulation, this paper proposes a vectorization method of a linear matrix solver on a vector computer. By applying loop optimizations such as splitting, unrolling, distribution, and interchange, efficient vector computing is realized. From the performance evaluation, it is shown that the optimizations achieve more than 15.5× speedups on the vector computer. Furthermore, the optimized simulation on a vector computer achieves a 2.4× to 3.7× speedup compared with the original version on the Intel Xeon.