OpenCL programs typically employ complex storage models and diverse data types as well as manifest various memory access patterns, which make it challenging to detect the performance problems ...effectively. However, few research efforts have been dedicated to cope with this challenge so far. In this paper, we introduce CVFuzz, a domain-independent tool that can effectively detect and locate algorithmic complexity vulnerabilities in OpenCL kernels. The key enabling idea is leveraging automatically generated pathological inputs to trigger the worst-case behavior during the execution of OpenCL kernels. Our approach takes advantage of the metrics such as code coverage and run time to guide the generation of inputs that can slow down the execution of a given OpenCL kernel. We evaluate CVFuzz on more than 250 real-world OpenCL kernels. The evaluation results demonstrate that the inputs generated by CVFuzz are effective in detecting the worst-case time algorithmic complexity and optimization vulnerabilities.
•We present a tool that can detect algorithmic complexity vulnerabilities in OpenCL kernels.•We present two methods to detect the optimization vulnerabilities in OpenCL kernels.•We evaluate the effectives of CVFuzz on a collection of open-source OpenCL applications deployed on GPUs.
The increased computing power of modern hybrid supercomputers is expanding the applicability of scale-resolving simulations in scientific and industrial applications. This paper is devoted to such a ...supercomputer simulation technology for turbomachinery. A heterogeneous parallel algorithm is outlined and its parallel performance is demonstrated on various hybrid supercomputers by executing a single simulation on dozens of central and graphics processing units. The scale-resolving numerical simulation of the flow in a linear cascade of T106C high-lift low-pressure turbine blades is considered. The presence of a laminar–turbulent transition on the suction side of the blade, caused by the separation of the boundary layer, leads to a significant increase in the loss of kinetic energy. The effect of increasing losses with a decrease in the Reynolds number is captured. A strong dependence of the numerical results on the incoming turbulent flow conditions is observed. The effect of imposing synthetic turbulence at the inflow is studied, as well as the required mesh resolution to capture the laminar–turbulent transition well.
•Scale-resolving simulations in turbomachinery applications are considered.•Multilevel MPI+OpenMP+OpenCL parallelization is used for heterogeneous computing.•Simulation results for the flow in a cascade of T106C turbine blades are presented.•The effect of increasing losses with decreasing Reynolds number is captured.•A strong influence of the inlet turbulent flow parameters is observed.
Given large-scale multi-dimensional data (e.g., (user, movie, time; rating) for movie recommendations), how can we extract latent concepts/relations of such data? Tensor factorization has been widely ...used to solve such problems with multi-dimensional data, which are modeled as tensors. However, most tensor factorization algorithms exhibit limited scalability and speed since they require huge memory and heavy computational costs while updating factor matrices. In this paper, we propose GTA, a general framework for Tucker factorization on heterogeneous platforms. GTA performs alternating least squares with a row-wise update rule in a fully parallel way, which significantly reduces memory requirements for updating factor matrices. Furthermore, GTA provides two algorithms: GTA-PART for partially observable tensors and GTA-FULL for fully observable tensors, both of which accelerate the update process using GPUs and CPUs. Experimental results show that GTA exhibits 5.6 \sim 44.6 \times5.6∼44.6× speed-up for large-scale tensors compared to the state-of-the-art. In addition, GTA scales near linearly with the number of GPUs and computing nodes used for experiments.
In recent years, the heterogeneity of both commodity and supercomputers hardware has increased sharply. Accelerators, such as GPUs or Intel Xeon Phi co-processors, are often key to improving speed ...and energy efficiency of highly-parallel codes. However, due to the complexity of heterogeneous architectures, optimization of codes for a certain type of architecture as well as porting codes across different architectures, while maintaining a comparable level of performance, can be extremely challenging. Addressing the challenges associated with performance optimization and performance portability, autotuning has gained a lot of interest. Autotuning of performance-relevant source-code parameters allows to automatically tune applications without hard coding optimizations and thus helps with keeping the performance portable. In this paper, we introduce a benchmark set of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA. Using our Kernel Tuning Toolkit, we show that with autotuning most of the kernels reach near-peak performance on various GPUs and outperform baseline implementations on CPUs and Xeon Phis. Our evaluation also demonstrates that autotuning is key to performance portability. In addition to offline tuning, we also introduce dynamic autotuning of code optimization parameters during application runtime. With dynamic tuning, the Kernel Tuning Toolkit enables applications to re-tune performance-critical kernels at runtime whenever needed, for example, when input data changes. Although it is generally believed that autotuning spaces tend to be too large to be searched during application runtime, we show that it is not necessarily the case when tuning spaces are designed rationally. Many of our kernels reach near peak-performance with moderately sized tuning spaces that can be searched at runtime with acceptable overhead. Finally we demonstrate, how dynamic performance tuning can be integrated into a real-world application from cryo-electron microscopy domain.
•Introduces dynamic autotuning of OpenCL or CUDA kernels with KTT framework.•Introduces a set of ten highly-efficient tunable benchmarks.•Evaluates benchmarks’ performance portability using various GPUs, CPU, and Xeon Phi.•Demonstrates dynamic autotuning with a real-world application.
Heterogeneous systems have become one of the most common architectures today, thanks to their excellent performance and energy consumption. However, due to their heterogeneity they are very complex ...to program and even more to achieve performance portability on different devices. This paper presents EngineCL, a new OpenCL-based runtime system that outstandingly simplifies the co-execution of a single massive data-parallel kernel on all the devices of a heterogeneous system. It performs a set of low level tasks regarding the management of devices, their disjoint memory spaces and scheduling the workload between the system devices while providing a layered API. EngineCL has been validated in two compute nodes (HPC and commodity system), that combine six devices with different architectures. Experimental results show that it has excellent usability compared with OpenCL; a maximum 2.8% of overhead compared to the native version under loads of less than a second of execution and a tendency towards zero for longer execution times; and it can reach an average efficiency of 0.89 when balancing the load.
•Performance portability is hard to maintain between heterogeneous devices.•EngineCL is an OpenCL-based runtime system to manage heterogeneous systems.•EngineCL simplifies and load balances a massive data-parallel kernel execution.•OpenCL drivers and architectures have complexities to be abstracted and optimized.•EngineCL is validated with high usability and low performance overhead.
Classical molecular dynamics (MD) calculations represent a significant part of the utilization time of high-performance computing systems. As usual, the efficiency of such calculations is based on an ...interplay of software and hardware that are nowadays moving to hybrid GPU-based technologies. Several well-developed open-source MD codes focused on GPUs differ both in their data management capabilities and in performance. In this work, we analyze the performance of LAMMPS, GROMACS and OpenMM MD packages with different GPU backends on Nvidia Volta and AMD Vega20 GPUs. We consider the efficiency of solving two identical MD models (generic for material science and biomolecular studies) using different software and hardware combinations. We describe our experience in porting the CUDA backend of LAMMPS to ROCm HIP that shows considerable benefits for AMD GPUs comparatively to the OpenCL backend.
•FPGA performance characterization using parametrizable OpenCL benchmarks.•High resource utilization and performance on Intel and Xilinx FPGAs.•Ability to also measure the performance of devices and ...tools.
Emerging high-level tools lead to a reduced development time for applications on FPGA accelerators while still producing high-quality results. This is one reason for the increased adoption of FPGAs in data center applications which emphasizes the need for a benchmark suite to enable the comparison of FPGA architecture, programming tools, runtimes, and libraries.
Because of the lack of such a benchmark suite, we have developed an OpenCL-based open-source implementation of the HPCC benchmark suite for Xilinx and Intel FPGAs. In an in-depth evaluation, we show that the benchmarks allow to quantify the impact of HBM2 memory in comparison to FPGAs with DDR and to analyze differences in the arithmetic units on current FPGA architectures. Power measurements indicate, that not all benchmark implementations can utilize the full potential of the FPGAs in terms of power efficiency. We are continuing to optimize and port the benchmark for new generations of FPGAs and design tools and we encourage active participation to create a valuable tool for the community.
This paper presents a technique for repairing errors in GPU kernels written in CUDA or OpenCL due to data races and barrier divergence. Our novel extension to prior work can also remove barriers that ...are deemed unnecessary for correctness. We implement these ideas in our tool called GPURepair, which uses GPUVerify as the verification oracle for GPU kernels. We also extend GPUVerify to support CUDA Cooperative Groups, allowing GPURepair to suggest inter-block synchronization for repairing a CUDA kernel if deemed necessary. To the best of our knowledge, GPURepair is the only tool that can propose a fix for intra-block data races and barrier divergence errors for both CUDA and OpenCL kernels. It is also the only tool that can propose fixes for inter-block data races in CUDA kernels. We perform extensive experiments on about 750 kernels and provide a comparison with prior work. We demonstrate the superiority of GPURepair through its capability to fix more kernels and its unique ability to remove redundant barriers and handle inter-block data races. We have also enhanced the initial version of GPURepair to support incremental solving during the repair process. This enhancement improves the performance of GPURepair by about 25% for the test suite that we have used.
In this paper, AQUAgpusph, a new free Smoothed Particle Hydrodynamics (SPH) software accelerated with OpenCL, is described. The main differences and progress with respect to other existing ...alternatives are considered. These are the use of the Open Computing Language (OpenCL) framework instead of the Compute Unified Device Architecture (CUDA), the implementation of the most popular boundary conditions, the easy customization of the code to different problems, the extensibility with regard to Python scripts, and the runtime output which allows the tracking of simulations in real time, or a higher frequency in saving some results without a significant performance lost. These modifications are shown to improve the solver speed, the results quality, and allow for a wider areas of application. AQUAgpusph has been designed trying to provide researchers and engineers with a valuable tool to test and apply the SPH method. Three practical applications are discussed in detail. The evolution of a dam break is used to quantify and compare the computational performance and modeling accuracy with the most popular SPH Graphics Processing Unit (GPU) accelerated alternatives. The dynamics of a coupled system, a Tuned Liquid Damper (TLD), is discussed in order to show the integration capabilities of the solver with external dynamics. Finally, the sloshing flow inside a nuclear reactor is simulated in order to show the capabilities of the solver to treat 3-D problems with complex geometries and of industrial interest.
Program title: AQUAgpusph 1.5
Catalogue identifier: AEVG_v1_0
Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEVG_v1_0.html
Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland
Licensing provisions: GNU General Public License, version 3
No. of lines in distributed program, including test data, etc.: 1702666
No. of bytes in distributed program, including test data, etc.: 75117178
Distribution format: tar.gz
Programming language: C++, OpenCL, Python.
Computer: Linux based computers with OpenCL support.
Operating system: Linux.
Has the code been vectorized or parallelized?: Code is parallelized with OpenCL.
Classification: 1.5.
Nature of problem: Complex geometry or heavily fragmented free surface fluid dynamics problems where mesh based method cannot be successfully applied.
Solution method: SPH is a meshless method where the fluid domain is discretized as a set of fluid particles. The fields in the fluid domain are smoothed using a kernel function, that allows to develop differential operators from the flow field values in scattered sets of particles.
Running time: Using an AMD HD-7970 graphic device 2×105 time steps of a 2-D simulation, with 105 particles and 8×102 neighs per particle, is requiring around 9 h of computation. A more detailed performance analysis will be carried out in the practical application section herein.
The improved k-nearest neighbor (KNN) algorithm based on class contribution and feature weighting (DCT-KNN) is a highly accurate approach. However, it requires complex computational steps which ...consumes much time for the classification process. A field programmable gate array (FPGA) can be used to solve this drawback. However, using traditional hardware description language (HDL) to implement FPGA-based accelerators requires a high design time. Fortunately, the open computing language (OpenCL) high level parallel programming tool allows rapid and effective design on FPGA-based hardware accelerators. In this study, OpenCL has been used to examine speeding up the DCT-KNN algorithm on the FPGA parallel computing platform through applying numerous parallelization and optimization techniques. The optimized approach of the improved KNN could be used in various engineering problems that require a high speed of classification process. Classification of the COVID-19 disease is the case study used to examine this work. The experimental results show that implementing the DCT-KNN algorithm on the FPGA platform (Intel De5a-net Arria-10 device was used) gives an extremely high performance when compared to the traditional single-core-CPU based implementation. The execution time for our optimized design on the FPGA accelerator is 44 times faster than the conventional design implemented on the regular CPU-based computational platform.