The improved k-nearest neighbor (KNN) algorithm based on class contribution and feature weighting (DCT-KNN) is a highly accurate approach. However, it requires complex computational steps which ...consumes much time for the classification process. A field programmable gate array (FPGA) can be used to solve this drawback. However, using traditional hardware description language (HDL) to implement FPGA-based accelerators requires a high design time. Fortunately, the open computing language (OpenCL) high level parallel programming tool allows rapid and effective design on FPGA-based hardware accelerators. In this study, OpenCL has been used to examine speeding up the DCT-KNN algorithm on the FPGA parallel computing platform through applying numerous parallelization and optimization techniques. The optimized approach of the improved KNN could be used in various engineering problems that require a high speed of classification process. Classification of the COVID-19 disease is the case study used to examine this work. The experimental results show that implementing the DCT-KNN algorithm on the FPGA platform (Intel De5a-net Arria-10 device was used) gives an extremely high performance when compared to the traditional single-core-CPU based implementation. The execution time for our optimized design on the FPGA accelerator is 44 times faster than the conventional design implemented on the regular CPU-based computational platform.
Modern processors are, in addition to general purpose cores, equipped with specialized hardware units, such as processor integrated GPUs (iGPUs), which are used in the investigations of this article. ...An iGPU is directly connected to the cores and provides several benefits, including low cost and energy efficiency. For the execution of scientific applications on iGPUs, the OpenCL framework is a suitable choice. In this article, we consider the modified Gram–Schmidt process for vector orthogonalization, which computes a QR decomposition, and develop several OpenCL program variants to be executed on an iGPU. The performance and energy consumption of the Gram–Schmidt OpenCL program variants are investigated on two different processor architectures with a Gen 7.5 and a Gen9 iGPU architecture. The program variants result from various modifications, such as the use of local memory, SIMD data types and the avoidance of copy operations. Additionally, we show, how the use of OpenCL SIMD data types and the avoidance of copy operations influences the energy consumption of the cores and the iGPU.
•Several OpenCL program variants of a modified Gram–Schmidt process are presented.•Energy consumption of cores and integrated GPU are analyzed independently.•Integrated GPUs prove to be well suited for scientific heterogeneous computing.
We have proposed, for the first time, an OpenCL implementation for the all-electron density-functional perturbation theory (DFPT) calculations in FHI-aims, which can effectively compute all its ...time-consuming simulation stages, i.e., the real-space integration of the response density, the Poisson solver for the calculation of the electrostatic potential, and the response Hamiltonian matrix, by utilizing various heterogeneous accelerators. Furthermore, to fully exploit the massively parallel computing capabilities, we have performed a series of general-purpose graphics processing unit (GPGPU)-targeted optimizations that significantly improved the execution efficiency by reducing register requirements, branch divergence, and memory transactions. Evaluations on the Sugon supercomputer have shown that notable speedups can be achieved across various materials.
Exchanging halo data is a common task in modern scientific computing applications and efficient handling of this operation is critical for the performance of the overall simulation. Tausch is a novel ...header-only library that provides a simple API for efficiently handling these types of data movements. Tausch supports both simple CPU-only systems, but also more complex heterogeneous systems with both CPUs and GPUs. It currently supports both OpenCL and CUDA for communicating with GPGPU devices, and allows for communication between GPGPUs and CPUs. The API allows for drop-in replacement in existing codes and can be used for the communication layer in new codes. This paper provides an overview of the approach taken in Tausch, and a performance analysis that demonstrates expected and achieved performance. We highlight the ease of use and performance with three applications: First Tausch is compared to the halo exchange framework from two Mantevo applications, HPCCG and miniFE, and then it is used to replace a legacy halo exchange library in the flexible multigrid solver framework Cedar.
•Efficient and flexible halo exchange library for large heterogeneous supercomputers.•API for handling data movements across GPGPUs using CUDA and OpenCL.•Use of MPI for communication across compute nodes.•Simple and flexible C/C++ API and Fortran module.
Irregular applications can be found in different scientific fields. In computer-aided drug design, molecular docking simulations play an important role in finding promising drug candidates. AutoDock ...is a software application widely used for predicting molecular interactions at close distances. It is characterized by irregular computations and long execution runtimes. In recent years, a hardware-accelerated version of AutoDock, called AutoDock-GPU, has been under active development. This work benchmarks the recent code and algorithmic enhancements incorporated into AutoDock-GPU. Particularly, we analyze the impact on execution runtime of techniques based on early termination. These enable AutoDock-GPU to explore the molecular space as necessary, while safely avoiding redundant computations. Our results indicate that it is possible to achieve average runtime reductions of 50% by using these techniques. Furthermore, a comprehensive literature review is also provided, where our work is compared to relevant approaches leveraging hardware acceleration for molecular docking.
•Irregular docking computations in AutoDock can be effectively accelerated using GPUs.•Early termination of molecular searches can reduce AutoDock-GPU runtimes.•Parallelism in molecular docking can be leveraged on various hardware accelerators.
Note: As shown in the figure, the growth morphology of the eutectic with and without flow is compared. Forced convection has a greater effect on the growth of the eutectic structure. The parallel ...algorithm of the phase field method in the GPU achieved more than 100 times the performance gain.
Display omitted
•The phase field and LBM are coupled to establish a PF-LBM model.•The growth of binary alloy eutectic under forced convection was simulated.•The feasibility of the PF-LBM simulation method based on GPU computing is proved.•The PF-LBM model implemented in GPU using OpenCL programming is relatively novel.•The research has achieved performance gains of two orders of magnitude.
During the solidification of alloys, numerical simulation of forced convection eutectic growth is essential for accurate prediction and control of solidified microstructure. The phase field lattice Boltzmann model (PF-LBM) is established by coupling the phase field model of KKSM model with the LBM model of computational fluid. In order to solve the problems of PF-LBM model, which has a large amount of computation, the simulation region is small and can not be solved in a reasonable time. The parallel computing architecture based on OpenCL can give full play to the powerful computing power of NVIDIA GPU and accelerate the implementation of the algorithm to improve the efficiency of numerical simulation. The effect of forced convection on the eutectic growth of CBr4-C2Cl6 eutectic alloy is systematically studied. The results showed that liquid flow altered the eutectic evolution by affecting the distributions of the solute ahead of the eutectic solid/liquid interface. Compared with serial CPU code, a single GPU can achieve a maximum acceleration ratio of 136.X. Solved the problem of 3-D numerical simulation of forced convection eutectic growth due to computational limitations.
Disparity estimation is an essential task taking part in many light-field applications. Due to the complexity of algorithms and high dimensional property of light-field data, performing this task ...involves a significant computational effort and results in very long processing time on CPU. Graphics processing units (GPUs), which is capable of massively parallel processing, is a promising solution to cover the computation requirement and speed up the task. In this paper, we develop a GPU-accelerated approach for light-field disparity estimation using a variational computation framework (GVLD). Our algorithm combines the intrinsic sub-pixel precision of variational formulation and the effectiveness of weighted median filtering to produce a highly accurate solution. The proposed algorithm is fully parallelized and optimized for the implementation using the OpenCL framework. An intensive evaluation including a quantitative comparison to related works and a detailed analysis of the proposed approach's performance is presented. Experimental results demonstrate our superior performance compared to state-of-the-art approaches. The proposed approach is 10+ times faster than other approaches running on a similar GPU platform and provides the most accurate solution among optimization-based approaches. Compared to the implementation running on CPU, our GPU-accelerated method achieves up to <inline-formula> <tex-math notation="LaTeX">365\times </tex-math></inline-formula> speed up.
Research on in-memory big data management and processing has been prompted by the increase in main memory capacity and the explosion in big data. By offering an efficient in-memory distributed ...execution model, existing in-memory cluster computing platforms such as Flink and Spark have been proven to be outstanding for processing big data. This paper proposes FlinkCL, an in-memory computing architecture on heterogeneous CPU-GPU clusters based on OpenCL that enables Flink to utilize GPU's massive parallel processing ability. Our proposed architecture utilizes four techniques: a heterogeneous distributed abstract model (HDST), a Just-In-Time (JIT) compiling schema, a hierarchical partial reduction (HPR) and a heterogeneous task management strategy. Using FlinkCL, programmers only need to write Java code with simple interfaces. The Java code can be compiled to OpenCL kernels and executed on CPUs and GPUs automatically. In the HDST, a novel memory mapping scheme is proposed to avoid serialization or deserialization between Java Virtual Machine (JVM) objects and OpenCL structs. We have comprehensively evaluated FlinkCL with a set of representative workloads to show its effectiveness. Our results show that FlinkCL improve the performance by up to <inline-formula><tex-math notation="LaTeX">11 \times</tex-math> <inline-graphic xlink:href="chen-ieq1-2839719.gif"/> </inline-formula> for some computationally heavy algorithms and maintains minor performance improvements for a I/O bound algorithm.
Conditional Restricted Boltzmann Machine (CRBM) is a promising candidate for a multidimensional system modeling that can learn a probability distribution over a set of data. It is a specific type of ...an artificial neural network with one input (visible) and one output (hidden) layer. Recently published works demonstrate that CRBM is a suitable mechanism for modeling multidimensional time series such as human motion, workload characterization, city traffic analysis. The process of learning and inference of these systems relies on linear algebra functions like matrix–matrix multiplication, and for higher data sets, they are very compute-intensive.
In this paper, we present a configurable framework for CRBM based workloads for arbitrary large models. We show how to accelerate the learning process of CRBM with FPGAs and OpenCL, and we conduct an extensive scalability study for different model sizes and system configurations. We show significant improvement in performance/Watt for large models and batch sizes (from 1.51x up to 5.71x depending on the host configuration) when we use FPGA and OpenCL for the acceleration, and limited benefits for small models comparing to the state-of-the-art CPU solution.
•A parametrizable framework for CRBM applications based on OpenCL for FPGA.•Implementation of GEMM on FPGA.•Optimization of the CPU (Host) code to support usage of FPGA GEMM designs.•CRBM based pplication scalability study.