This paper presents a lightweight, open-source and high-performance python package for solving peridynamics problems in solid mechanics. The development of this solver is motivated by the need for ...fast analysis tools to achieve the large number of simulations required for ‘outer-loop’ applications, including sensitivity analysis, uncertainty quantification and optimisation. Our python software toolbox utilises the heterogeneous nature of OpenCL so that it can be executed on any platform with CPU or GPU cores. We illustrate the package use through a range of industrially motivated examples, which should enable other researchers to build on and extend the solver for use in their own applications. Step improvements in execution speed and functionality over existing techniques are presented. A comparison between this solver and an existing OpenCL implementation in the literature is presented, tested on benchmarks with hundreds of thousands to tens of millions of nodes. We demonstrate the scalability of the solver on the GeForce RTX 2080 TiGPU from NVIDIA, and the memory-bound limitations are analysed. In all test cases, the implementation is between 1.4 and 10.0 times faster than a similar existing GPU implementation in the literature. In particular, this improvement has been achieved by utilising local memory on the GPU.
•Tested, lightweight, open-source, python peridynamics solver with simple interface.•OpenCL GPU accelerated peridynamics solver 1.4–10 times faster than existing.•Outer loop optimisation of model parameters for modelling concrete behaviour.•Sensitivity analysis, uncertainty quantification, optimisation made possible.
Display omitted
The real-valued fast Fourier transform (RFFT) is an ideal candidate for implementing a high-speed and low-power FFT processor because it only has approximately half the num-ber of arithmetic ...operations compared with traditional complex-valued FFT (CFFT). Although RFFT can be calculated using CFFT hardware, a dedicated RFFT implementation can result in reduced hardware complexity, power consumption and increased throughput. However, unlike CFFT, RFFT has irregular signal flow graphs which hinders the design of efficient pipelined archi-tectures. In this paper, utilizing Open Computing Language (OpenCL), we propose a high-level programming method for the implementation of pipelined architectures of RFFT on FPGAs. By identifying the regular computational pattern in the flow graph of RFFT, the proposed method essentially uses a for loop to implement the RFFT algorithm, and later with the help of high level synthesis tools, the loop is fully unrolled to automatically build pipelined architectures. Experiments show that for a 4096-point RFFT, the proposed method achieves a 2.49x speedup and 3.09x better energy efficiency over CUFFT on GPU, and a 21.12x speedup and 16.09x better energy efficiency over FFTW on CPU respectively. Compared to Intels CFFT design on the same FPGA, the proposed one reduces 12% logic resources and 16% DSP blocks respectively, while achieving a 1.48x speedup.
•The Sigmoid algorithm applies the well-known sigmoid function to load balancing.•The algorithm uses all the resources of a system to their fullest requiring no effort.•It adapts to different ...hardware configurations and application behaviours.•Sigmoid successfully scales to adapt to the systems of the future.
A challenge that heterogeneous system programmers face is leveraging the performance of all the devices that integrate the system. This paper presents Sigmoid, a new load balancing algorithm that efficiently co-executes a single OpenCL data-parallel kernel on all the devices of heterogeneous systems. Sigmoid splits the workload proportionally to the capabilities of the devices, drastically reducing response time and energy consumption. It is designed around several features; it is dynamic, adaptive, guided and effortless, as it does not require the user to give any parameter, adapting to the behaviour of each kernel at runtime. To evaluate Sigmoid's performance, it has been implemented in Maat, a system abstraction library. Experimental results with different kernel types show that Sigmoid exhibits excellent performance, reaching a utilization of 90%, together with energy savings up to 20%, always reducing programming effort compared to OpenCL, and facilitating the portability to other heterogeneous machines.
Display omitted
•Most of the past and present work on GPU accelerated medical imaging is reviewed.•Basic operations, common algorithms and modality specific applications are included.•Registration ...and segmentation algorithms and the CT modality dominate the GPU usage.•The number of publications has clearly increased since the release of CUDA.•Future possibilities and challenges of GPU-based medical imaging are discussed.
Graphics processing units (GPUs) are used today in a wide range of applications, mainly because they can dramatically accelerate parallel computing, are affordable and energy efficient. In the field of medical imaging, GPUs are in some cases crucial for enabling practical use of computationally demanding algorithms. This review presents the past and present work on GPU accelerated medical image processing, and is meant to serve as an overview and introduction to existing GPU implementations. The review covers GPU acceleration of basic image processing operations (filtering, interpolation, histogram estimation and distance transforms), the most commonly used algorithms in medical imaging (image registration, image segmentation and image denoising) and algorithms that are specific to individual modalities (CT, PET, SPECT, MRI, fMRI, DTI, ultrasound, optical imaging and microscopy). The review ends by highlighting some future possibilities and challenges.
A strategy to improve the performance and reduce the memory footprint of simulations on meshes with spatial reflection symmetries is presented in this work. By using an appropriate mirrored ordering ...of the unknowns, discrete partial differential operators are represented by matrices with a regular block structure that allows replacing the standard sparse matrix–vector product with a specialised version of the sparse matrix-matrix product, which has a significantly higher arithmetic intensity. Consequently, matrix multiplications are accelerated, whereas their memory footprint is reduced, making massive simulations more affordable. As an example of practical application, we consider the numerical simulation of turbulent incompressible flows using a low-dissipation discretisation on unstructured collocated grids. All the required matrices are classified into three sparsity patterns that correspond to the discrete Laplacian, gradient, and divergence operators. Therefore, the above-mentioned benefits of exploiting spatial reflection symmetries are tested for these three matrices on both CPU and GPU, showing up to 5.0x speed-ups and 8.0x memory savings. Finally, a roofline performance analysis of the symmetry-aware sparse matrix–vector product is presented.
•Strategy to accelerate CFD simulations on meshes with spatial reflection symmetries.•Replacement of SpMV with a specialised version of the more-compute intensive SpMM•Implementation of a lighter sparse matrix storage format accounting for symmetries.•Hierarchical multilevel MPI+OpenMP+OpenCL/CUDA parallelisation.•Numerical tests on CPUs and GPUs show up to 5x speed-ups and 8x memory savings.
► We introduce GPU run-time code generation (RTCG) and discuss its usefulness. ► We present two open-source toolkits, PyCUDA and PyOpenCL that make GPU RTCG possible. ► High-level scripting code is ...complementary to high-performance GPU code. ► C-level GPU RTCG is shown to be a good building block for higher-level abstractions. ► Successful applications support the usefulness of the approach.
High-performance computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique,
GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL, two open-source toolkits that supports this technique.
In introducing PyCUDA and PyOpenCL, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. The concept of RTCG is simple and easily implemented using existing, robust infrastructure. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.
Nowadays, the CFD approaches for modeling turbulent airflow and particulate matter (PM) concentration distribution are matured despite they all suffer from heavy computational demands. Particularly, ...in indoor PM concentration modeling, a novel cellular automata (CA) approach in “Modeling particulate matter concentration in indoor environment with cellular automata framework” is developed to achieve almost the same accuracy with improved efficiency as the Eulerian approach. To further enhance its efficiency, this study proposes two parallelization procedures. With the mechanism parallelization, the four PM transport mechanisms (flow advection, turbulent diffusion, gravitational settling, and boundary deposition) are simulated simultaneously instead of sequentially. Besides, using the GPU-based cell parallelization by adopting OpenCL 2.1 under Nvidia CUDA, the execution of all the PM transport mechanisms is performed parallelly on GPUs instead of sequentially on the CPU. Three parallelized CA scenarios, i.e., the parallelized CA approach with only the mechanism parallelization, with only the GPU-based cell parallelization, and with both the two parallelization procedures, are evaluated through two indoor PM concentration experiments. The three parallelized CA scenarios are found to maintain the accuracy but enhance the efficiency by 174%–210%, 1780%–5730%, and 2427%–7695%, respectively. Thus, the GPU-based cell parallelization obtains more efficiency enhancement than the mechanism parallelization. Furthermore, despite the simulations are performed on an i9 PC with an Intel UHD Graphics 630 graphic card, the parallelized CA approach with both the two parallelization procedures can enhance its efficiency up to 24–77 times, proving its considerable potentials as a useful tool for real-time 3D indoor PM distribution modeling.
•We parallelize the CA-based indoor PM modeling approach.•The mechanism parallelization runs the PM mechanisms parallelly instead of serially.•The GPU-based cell parallelization executes each loop of a PM mechanism on GPUs.•The parallelized CA approach is as accurate as the Eulerian drift-flux models.•The parallelized CA approach achieves an efficiency enhancement up to 22–77 times.
The increasing main memory capacity and the explosion of big data have fueled the development of in-memory big data management and processing. By offering an efficient in-memory parallel execution ...model which can eliminate disk I/O bottleneck, existing in-memory cluster computing platforms (e.g., Flink and Spark) have already been proven to be outstanding platforms for big data processing. However, these platforms are merely CPU-based systems. This paper proposes GFlink, an in-memory computing architecture on heterogeneous CPU-GPU clusters for big data. Our proposed architecture extends the original Flink from CPU clusters to heterogeneous CPU-GPU clusters, greatly improving the computational power of Flink. Furthermore, we have proposed a programming framework based on Flink's abstract model, i.e., DataSet (DST), hiding the programming complexity of GPUs behind the simple and familiar high-level interfaces. To achieve high performance and good load-balance, an efficient JVM-GPU communication strategy, a GPU cache scheme, and an adaptive locality-aware scheduling scheme for three-stage pipelining execution are proposed. Extensive experiment results indicate that the high computational power of GPUs can be efficiently utilized, and the implementation on GFlink outperforms that on the original CPU-based Flink.
General-purpose computing on graphics processing units (GPGPU) is increasingly used for number crunching tasks such as analyzing time series data. GPUs are a good fit for these tasks as they can ...execute many computations in parallel. To leverage this parallelism, the programmer is forced to carefully divide their input data into data blocks that are then distributed over the many GPU cores. The optimal block sizes are unrelated to the programmers goals, instead, they are based on characteristics of the used GPU and the input data. GPGPU programmers must additionally be wary of introducing race conditions in their programs.
We believe that GPGPU programmers should be able to express GPU transformations without worrying about splitting data or race conditions. For this, we created Gaiwan, a GPGPU programming language with a size-polymorphic type system that only features data race free operations. Programmers can declare the effects of program steps on the sizes of buffers by using affine functions (e.g. ▪). From a step sequence, Gaiwan derives a set of constraints on the size and shape of valid inputs. Gaiwan guarantees that the program will run for any input satisfying these constraints. This means that one program may analyze both a hundred data points and millions of data points, as long as the input satisfies the constraints.
We prove that our system is sound and show it works with two usage examples. Our benchmarks show that our initial OpenCL-based implementation of Gaiwan scales to handling large programs.
•A novel constraint-based size-polymorphic type system.•A novel data-race free programming language for GPUs.•A prototype implementation of our language called Gaiwan.•Initial benchmarks to show the effectiveness of the approach.