Using OpenCL Kowalik, J; Puźniakowski, T
2012, Letnik:
21
eBook
Recenzirano
In 2011 many computer users were exploring the opportunities and the benefits of the massive parallelism offered by heterogeneous computing. In 2000 the Khronos Group, a not-for-profit industry ...consortium, was founded to create standard open APIs for parallel computing, graphics and dynamic media. Among them has been OpenCL, an open system for programming heterogeneous computers with components made by multiple manufacturers. This publication explains how heterogeneous computers work and how to program them using OpenCL. It also describes how to combine OpenCL with OpenGL for displaying graphical effects in real time. Chapter 1 describes briefly two older de facto standard and highly successful parallel programming systems: MPI and OpenMP. Collectively, the MPI, OpenMP, and OpenCL systems cover programming of all major parallel architectures: clusters, shared-memory computers, and the newest heterogeneous computers. Chapter 2, the technical core of the book, deals with OpenCL fundamentals: programming, hardware, and the interaction between them. Chapter 3 adds important information about such advanced issues as double-versus-single arithmetic precision, efficiency, memory use, and debugging. Chapters 2 and 3 contain several examples of code and one case study on genetic algorithms. These examples are related to linear algebra operations, which are very common in scientific, industrial, and business applications. Most of the book's examples can be found on the enclosed CD, which also contains basic projects for Visual Studio, MinGW, and GCC. This supplementary material will assist the reader in getting a quick start on OpenCL projects.
•A real-time and accurate 3D measurement method using a monocular 3D sensor based on the infrared speckle projection.•Based on the 3D imaging principle of monocular 3D sensors, a reference plane ...calibration method is proposed to obtain a high-quality reference speckle image for improving the monocular matching accuracy.•an optimized semi-global matching (SGM) algorithm using GPU is presented to achieve efficient and accurate depth reconstruction dynamically.•Within the measurement range of 0.8 m (length) × 0.5 m (width) × 1 m (depth), the proposed method can achieve real-time and single-shot 3D imaging with an accuracy of 1.277 mm at 75 FPS on GTX 1060 and 15 FPS on ARM Mail G52(mobile platform).
Speckle projection profilometry (SPP), as a promising structured light projection technique, can achieve global unambiguous 3D measurement by projecting a single random speckle pattern. In addition, the projected speckle pattern is usually etched into the microstructure of highly integrated Vertical-Cavity Surface-Emitting Laser (VCSEL), which makes the hardware system compact enough to be mounted on mobile devices such as robots. However, since the stereo matching algorithm used in SPP involves high computational overhead, it usually runs in real-time on specially customized hardware platforms such as ASIC/FPGA, rather than general-purpose mobile platforms. In this paper, we propose a real-time and accurate 3D measurement method using a monocular 3D sensor based on the infrared speckle projection. Similar to Kinect v1, our sensor mainly consists of an IR dot projector and one IR camera for projecting and capturing speckle images synchronously. Low-cost and high-quality speckle projection is achieved by customizing the projection pattern of VCSEL and using the beam copy function of Diffractive Optical Elements (DOE). Based on the 3D imaging principle of monocular 3D sensors, a reference plane calibration method is proposed to obtain a high-quality reference speckle image for improving the monocular matching accuracy. Then, benefited from the local memory mechanism and multiple operating synchronizations on the OpenCL environment, an optimized semi-global matching (SGM) algorithm using GPU is presented to achieve efficient and accurate depth reconstruction dynamically. Within the measurement range of 0.8m (length) ×0.5m (width) ×1m (depth), the proposed method can achieve real-time and single-shot 3D imaging with an accuracy of 1.277 mm at 75 FPS on GTX 1060 and 15 FPS on ARM Mail G52(mobile platform).
•Solution for scale-resolving simulations of subsonic and supersonic turbulent flows.•Heterogeneous computing on many CPUs and GPUs of hybrid supercomputers.•Maximum portability with hierarchical MPI ...+ OpenMP + OpenCL parallelization.•High-accuracy edge-based schemes on unstructured mixed-element meshes.•Detached eddy simulation with implicit time integration.
A heterogeneous parallel algorithm for simulation of compressible turbulent flows and its portable software implementation are presented. The underlying numerical method is based on a family of higher accuracy edge-based reconstruction schemes on unstructured mixed-element meshes. The proposed parallel solution can engage a large number of computing devices of most of the existing computing architectures used in modern supercomputers, including manycore CPUs and GPUs. It is capable of co-execution on both CPUs and accelerators simultaneously. The multilevel parallel algorithm combines: MPI for distributing workload among hybrid cluster nodes and between devices inside nodes; OpenMP for manycore CPUs and other supporting devices, such as Intel Xeon Phi; OpenCL for massively-parallel accelerators, such as GPUs of various vendors, including NVIDIA, AMD, Intel. The main focus is on the adaptation of the numerical method and its computational algorithm to the stream processing parallel paradigm. The very limited device memory inherent in GPU computing is also taken into account. A detailed description of the parallel algorithm is presented, as well as the techniques used for its efficient parallel implementation. Special attention is paid to implicit time integration with its linear solver and calculation of convective fluxes and viscous terms. The use of mixed floating-point precision and overlapping communications and computations is also discussed. Parallel performance is demonstrated in practical applications on different kinds of supercomputers using up to 10 thousand cores and multiple GPUs of comparable overall performance.
Data Parallel C++ Reinders, James; Ashbaugh, Ben; Brodman, James ...
2020, 2020-11-02T00:00:00, 2020-11-02, 2021.
eBook
Learn how to accelerate C++ programs using data parallelism. This open access book enables C++ programmers to be at the forefront of this exciting and important new development that is helping to ...push computing to new levels. It is full of practical advice, detailed explanations, and code examples to illustrate key topics. Data parallelism in C++ enables access to parallel resources in a modern heterogeneous system, freeing you from being locked into any particular computing device. Now a single C++ application can use any combination of devices—including GPUs, CPUs, FPGAs and AI ASICs—that are suitable to the problems at hand.This book begins by introducing data parallelism and foundational topics for effective use of the SYCL standard from the Khronos Group and Data Parallel C++ (DPC++), the open source compiler used in this book. Later chapters cover advanced topics including error handling, hardware-specific programming, communication and synchronization, and memory model considerations.Data Parallel C++ provides you with everything needed to use SYCL for programming heterogeneous systems.What You'll LearnAccelerate C++ programs using data-parallel programmingTarget multiple device types (e.g. CPU, GPU, FPGA)Use SYCL and SYCL compilers Connect with computing’s heterogeneous future via Intel’s oneAPI initiativeWho This Book Is ForThose new data-parallel programming and computer programmers interested in data-parallel programming using C++.
AutoDock Vina is one of the most popular molecular docking tools. In the latest benchmark CASF-2016 for comparative assessment of scoring functions, AutoDock Vina won the best docking power among all ...the docking tools. Modern drug discovery is facing a common scenario of large virtual screening of drug hits from huge compound databases. Due to the seriality characteristic of the AutoDock Vina algorithm, there is no successful report on its parallel acceleration with GPUs. Current acceleration of AutoDock Vina typically relies on the stack of computing power as well as the allocation of resource and tasks, such as the VirtualFlow platform. The vast resource expenditure and the high access threshold of users will greatly limit the popularity of AutoDock Vina and the flexibility of its usage in modern drug discovery. In this work, we proposed a new method, Vina-GPU, for accelerating AutoDock Vina with GPUs, which is greatly needed for reducing the investment for large virtual screens and also for wider application in large-scale virtual screening on personal computers, station servers or cloud computing, etc. Our proposed method is based on a modified Monte Carlo using simulating annealing AI algorithm. It greatly raises the number of initial random conformations and reduces the search depth of each thread. Moreover, a classic optimizer named BFGS is adopted to optimize the ligand conformations during the docking progress, before a heterogeneous OpenCL implementation was developed to realize its parallel acceleration leveraging thousands of GPU cores. Large benchmark tests show that Vina-GPU reaches an average of 21-fold and a maximum of 50-fold docking acceleration against the original AutoDock Vina while ensuring their comparable docking accuracy, indicating its potential for pushing the popularization of AutoDock Vina in large virtual screens.
Data Parallel C++ Reinders, James; Ashbaugh, Ben; Brodman, James ...
2023
eBook
Odprti dostop
"This book, now in is second edition, is the premier resource to learn SYCL 2020 and is the ONLY book you need to become part of this community." Erik Lindahl, GROMACS and Stockholm University Learn ...how to accelerate C++ programs using data parallelism and SYCL. This open access book enables C++ programmers to be at the forefront of this exciting and important development that is helping to push computing to new levels. This updated second edition is full of practical advice, detailed explanations, and code examples to illustrate key topics. SYCL enables access to parallel resources in modern accelerated heterogeneous systems. Now, a single C++ application can use any combination of devices–including GPUs, CPUs, FPGAs, and ASICs–that are suitable to the problems at hand. This book teaches data-parallel programming using C++ with SYCL and walks through everything needed to program accelerated systems. The book begins by introducing data parallelism and foundational topics for effective use of SYCL. Later chapters cover advanced topics, including error handling, hardware-specific programming, communication and synchronization, and memory model considerations. All source code for the examples used in this book is freely available on GitHub. The examples are written in modern SYCL and are regularly updated to ensure compatibility with multiple compilers. What You Will Learn Accelerate C++ programs using data-parallel programming Use SYCL and C++ compilers that support SYCL Write portable code for accelerators that is vendor and device agnostic Optimize code to improve performance for specific accelerators Be poised to benefit as new accelerators appear from many vendors Who This Book Is For New data-parallel programming and computer programmers interested in data-parallel programming using C++ This is an open access book.
The real-valued fast Fourier transform (RFFT) is an ideal candidate for implementing a high-speed and low-power FFT processor because it only has approximately half the num-ber of arithmetic ...operations compared with traditional complex-valued FFT (CFFT). Although RFFT can be calculated using CFFT hardware, a dedicated RFFT implementation can result in reduced hardware complexity, power consumption and increased throughput. However, unlike CFFT, RFFT has irregular signal flow graphs which hinders the design of efficient pipelined archi-tectures. In this paper, utilizing Open Computing Language (OpenCL), we propose a high-level programming method for the implementation of pipelined architectures of RFFT on FPGAs. By identifying the regular computational pattern in the flow graph of RFFT, the proposed method essentially uses a for loop to implement the RFFT algorithm, and later with the help of high level synthesis tools, the loop is fully unrolled to automatically build pipelined architectures. Experiments show that for a 4096-point RFFT, the proposed method achieves a 2.49x speedup and 3.09x better energy efficiency over CUFFT on GPU, and a 21.12x speedup and 16.09x better energy efficiency over FFTW on CPU respectively. Compared to Intels CFFT design on the same FPGA, the proposed one reduces 12% logic resources and 16% DSP blocks respectively, while achieving a 1.48x speedup.
Nonrigid image registration is an important, but time-consuming task in medical image analysis. In typical neuroimaging studies, multiple image registrations are performed, i.e., for atlas-based ...segmentation or template construction. Faster image registration routines would therefore be beneficial. In this paper we explore acceleration of the image registration package elastix by a combination of several techniques: (i) parallelization on the CPU, to speed up the cost function derivative calculation; (ii) parallelization on the GPU building on and extending the OpenCL framework from ITKv4, to speed up the Gaussian pyramid computation and the image resampling step; (iii) exploitation of certain properties of the B-spline transformation model; (iv) further software optimizations. The accelerated registration tool is employed in a study on diagnostic classification of Alzheimer's disease and cognitively normal controls based on T1-weighted MRI. We selected 299 participants from the publicly available Alzheimer's Disease Neuroimaging Initiative database. Classification is performed with a support vector machine based on gray matter volumes as a marker for atrophy. We evaluated two types of strategies (voxel-wise and region-wise) that heavily rely on nonrigid image registration. Parallelization and optimization resulted in an acceleration factor of 4-5x on an 8-core machine. Using OpenCL a speedup factor of 2 was realized for computation of the Gaussian pyramids, and 15-60 for the resampling step, for larger images. The voxel-wise and the region-wise classification methods had an area under the receiver operator characteristic curve of 88 and 90%, respectively, both for standard and accelerated registration. We conclude that the image registration package elastix was substantially accelerated, with nearly identical results to the non-optimized version. The new functionality will become available in the next release of elastix as open source under the BSD license.