The combined impact of new computing resources and techniques with an increasing avalanche of large datasets, is transforming many research areas and may lead to technological breakthroughs that can ...be used by billions of people. In the recent years, Machine Learning and especially its subfield Deep Learning have seen impressive advances. Techniques developed within these two fields are now able to analyze and learn from huge amounts of real world examples in a disparate formats. While the number of Machine Learning algorithms is extensive and growing, their implementations through frameworks and libraries is also extensive and growing too. The software development in this field is fast paced with a large number of open-source software coming from the academy, industry, start-ups or wider open-source communities. This survey presents a recent time-slide comprehensive overview with comparisons as well as trends in development and usage of cutting-edge Artificial Intelligence software. It also provides an overview of massive parallelism support that is capable of scaling computation effectively and efficiently in the era of Big Data.
The efficient evaluation of Sommerfeld integrals (SIs) in planar layered media has been a long-term bottleneck in the accurate electromagnetic analysis of modern radio frequency circuits, chips, and ...devices. This work investigates the high-performance computing of SIs using modern graphics processing units (GPUs) to alleviate this difficulty. Based on the numerical integration procedure with controllable accuracy for SIs, the GPU parallel schemes for SI heads and SI tails are first presented. By eliminating the redundant calculations in SI of multiple frequencies, highly efficient parallel computing enhanced by tensor cores of GPU is developed, and it evaluates the multiple frequency SIs simultaneously. In addition, the mixed-precision computing that further accelerates computing is also studied and tested. Extensively numerical experiments are carried out on two commercial gaming GPUs and verify the performance of the proposed parallel scheme. It achieves a dozen to hundreds speedup compared to that using two high-end CPUs with full OpenMP parallelization.
Recent years have witnessed phenomenal growth in the computational capabilities and applications of GPUs. However, this trend has also led to a dramatic increase in their power consumption. This ...article surveys research works on analyzing and improving energy efficiency of GPUs. It also provides a classification of these techniques on the basis of their main research idea. Further, it attempts to synthesize research works that compare the energy efficiency of GPUs with other computing systems (e.g., FPGAs and CPUs). The aim of this survey is to provide researchers with knowledge of the state of the art in GPU power management and motivate them to architect highly energy-efficient GPUs of tomorrow.
Oil spills have adverse effects on the environment and economy. Near real time detection and response activities enable to better manage the required resources at the incident area for clean-up and ...control operations. Multi-temporal remote sensing (RS) technologies are widely used to detect and monitor oil spills on the Ocean surfaces. However, current techniques using RS data for oil spill detection are time consuming and expensive in terms of computational cost and related infrastructure. The main focus of this work is oil spill detection from voluminous multi-temporal LANDSAT-7 imagery using high performance computing technologies such as graphics processing units (GPUs) and Message Passing Interface (MPI) to speed up the detection process and provide rapid response. Kepler compute architecture based GPU (Tesla K40) with Compute Unified Device Architecture (CUDA), which is a parallel programming mechanism for GPU is used in the development of the detection algorithms. Oil spill detection techniques that were adapted to GPU based processing include band-ratio and Morphological attribute profile (MAP) based on six structural and shape description attributes namely, Gray mean, standard deviation, elongation, shape complexity, solidity and orientation. Experimental results show the significant gains in the computational speed of these techniques when implemented on a GPU and MPI. A GPU vs. CPU comparison shows that the proposed approach achieves a speedup of around 10× for MAP and 14× for band ratio approaches, which includes the data transfer cost. However, the MPI implementation using 64 cores outperforms the GPU, and executes the time intensive task of computing the above said attributes in only 18min, whereas a GPU consumes around an hour.
To develop and validate a graphics processing unit (GPU) based superposition Monte Carlo (SMC) code for efficient and accurate dose calculation in magnetic fields.
A series of mono-energy photons ...ranging from 25 keV to 7.7 MeV were simulated with EGSnrc in a water phantom to generate particle tracks database. SMC physics was extended with charged particle transport in magnetic fields and subsequently programmed on GPU as gSMC. Optimized simulation scheme was designed by combining variance reduction techniques to relieve the thread divergence issue in general GPU-MC codes and improve the calculation efficiency. The gSMC code's dose calculation accuracy and efficiency were assessed through both phantoms and patient cases.
gSMC accurately calculated the dose in various phantoms for both
= 0 T and
= 1.5 T, and it matched EGSnrc well with a root mean square error of less than 1.0% for the entire depth dose region. Patient cases validation also showed a high dose agreement with EGSnrc with 3D gamma passing rate (2%/2 mm) large than 97% for all tested tumor sites. Combined with photon splitting and particle track repeating techniques, gSMC resolved the thread divergence issue and showed an efficiency gain of 186-304 relative to EGSnrc with 10 CPU threads.
A GPU-superposition Monte Carlo code called gSMC was developed and validated for dose calculation in magnetic fields. The developed code's high calculation accuracy and efficiency make it suitable for dose calculation tasks in online adaptive radiotherapy with MR-LINAC.
Computer vision and image processing algorithms form essential components of many industrial, medical, commercial, and research-related applications. Modern imaging systems provide high resolution ...images at high frame rates, and are often required to perform complex computations to process image data. However, in many applications rapid processing is required, or it is important to minimise delays for analysis results. In these applications, central processing units (CPUs) are inadequate, as they cannot perform the calculations with sufficient speed. To reduce the computation time, algorithms can be implemented in hardware accelerators such as digital signal processors (DSPs), field-programmable gate arrays (FPGAs), and graphics processing units (GPUs). However, the selection of a suitable hardware accelerator for a specific application is challenging. Numerous families of DSPs, FPGAs, and GPUs are available, and the technical differences between various hardware accelerators make comparisons difficult. It is also important to know what speed can be achieved using a specific hardware accelerator for a particular algorithm, as the choice of hardware accelerator may depend on both the algorithm and the application. The technical details of hardware accelerators and their performance have been discussed in previous publications. However, there are limitations in many of these presentations, including: inadequate technical details to enable selection of a suitable hardware accelerator; comparisons of hardware accelerators at two different technological levels; and discussion of old technologies.
To address these issues, we introduce and discuss important considerations when selecting suitable hardware accelerators for computer vision and image processing tasks, and present a comprehensive review of hardware accelerators. We discuss the practical details of chip architectures, available tools and utilities, development time, and the relative advantages and disadvantages of using DSPs, FPGAs, and GPUs. We provide practical information about state-of-the-art DSPs, FPGAs, and GPUs as well as examples from the literature. Our goal is to enable developers to make a comprehensive comparison between various hardware accelerators, and to select a hardware accelerator that is most suitable for their specific application.
•Important considerations when selecting hardware accelerators are discussed.•Practical information about state-of-the-art DSPs, FPGAs, and GPUs are presented.•Relative advantages and disadvantages of DSPs, FPGAs, and GPUs are explained.•Several recent examples from the literature are reviewed and compared.
We describe the underlying mathematics, validation, and applications of a novel Helmholtz free-energy—minimizing phase-field model solved within the framework of the lattice Boltzmann method (LBM) ...for efficiently simulating two-phase pore-scale flow directly on large 3D images of real rocks obtained from micro-computed tomography (micro-CT) scanning. The code implementation of the technique, coined as the eLBM (energy-based LBM), is performed in CUDA programming language to take maximum advantage of accelerated computing by use of multinode general-purpose graphics processing units (GPGPUs). eLBM’s momentum-balance solver is based on the multiple-relaxation-time (MRT) model. The Boltzmann equation is discretized in space, velocity (momentum), and time coordinates using a 3D 19-velocity grid (D3Q19 scheme), which provides the best compromise between accuracy and computational efficiency. The benefits of the MRT model over the conventional single-relaxation-time Bhatnagar-Gross-Krook (BGK) model are (I) enhanced numerical stability, (II) independent bulk and shear viscosities, and (III) viscosity-independent, nonslip boundary conditions. The drawback of the MRT model is that it is slightly more computationally demanding compared to the BGK model. This minor hurdle is easily overcome through a GPGPU implementation of the MRT model for eLBM. eLBM is, to our knowledge, the first industrial grade–distributed parallel implementation of an energy-based LBM taking advantage of multiple GPGPU nodes. The Cahn-Hilliard equation that governs the order-parameter distribution is fully integrated into the LBM framework that accelerates the pore-scale simulation on real systems significantly. While individual components of the eLBM simulator can be separately found in various references, our novel contributions are (1) integrating all computational and high-performance computing components together into a unified implementation and (2) providing comprehensive and definitive quantitative validation results with eLBM in terms of robustness and accuracy for a variety of flow domains including various types of real rock images. We successfully validate and apply the eLBM on several transient two-phase flow problems of gradually increasing complexity. Investigated problems include the following: (1) snap-off in constricted capillary tubes; (2) Haines jumps on a micromodel (during drainage), Ketton limestone image, and Fontainebleau and Castlegate sandstone images (during drainage and subsequent imbibition); and (3) capillary desaturation simulations on a Berea sandstone image including a comparison of numerically computed residual non-wetting-phase saturations (as a function of the capillary number) to data reported in the literature. Extensive physical validation tests and applications on large 3D rock images demonstrate the reliability, robustness, and efficacy of the eLBM as a direct visco-capillary pore-scale two-phase flow simulator for digital rock physics workflows.
Landslide-induced tsunami is a complex fluid–solid coupling process that plays a crucial role in the study of a disaster chain. To simulate the coupling behaviors between the fluid and solid, a ...graphics processing unit-based coupled smoothed particle hydrodynamics (SPH)-discrete element method (DEM) code is developed. A series of numerical tests, which are based on the laboratory test by Koshizuka et al. (Particle method for calculating splashing of incompressible viscous fluid, 1995) and Kleefsman et al. (J Comput Phys 206:363–393, 2005), are carried out to study the influence of the parameters, and to verify the accuracy of the developed SPH code. To ensure accurate results of the SPH simulation, the values for the diffusion term, particle resolution (1/25 characteristic length), and smoothing length (1.2 times of particle interval) are suggested. The ratio of the SPH particle size and the DEM particle’s diameter influences the accuracy of the coupling simulation between solid particles and water. For the coupling simulation of a single particle or a loose particle assembly (not contact each other) with fluid, this ratio should be smaller than 1/20; for a dense particle assembly, a ratio of smaller than 1/6 will be good.
•We convert multi-label classification into N binary classification.•We modify ResNet by introducing adaptive dropout.•We evaluate the presented model on NIH Chest X-ray dataset.
The recent advance ...of high-performance computing techniques like graphics processing unit (GPU) enables large-scale deep learning models for medical image analytics in smart medicine. Smart medicine has made great progress by applying convolutional neural networks (CNNs) like ResNet and VGG-16 to medical image classification. However, various CNN models achieve very limited accuracy in some cases where multiple diseases are revealed in an X-ray image. This paper presents a variant ResNet model by replacing the global average pooling with the adaptive dropout for medical image classification. In order for the presented model to recognize multiple diseases (i.e., multi-label classification), we convert the multi-label classification to N binary classification by training the parameters of the presented model for N times. Finally, experiments are conducted on a GPU Cluster to evaluate the presented model on three datasets, namely Montgomery County chest X-ray set, Shenzhen X-ray set, and NIH chest X-ray set. The results show the presented model achieves a great performance improvement for medical image classification without a significant efficiency reduction compared to the traditional architecture and VGG-16.
Graphics Processing Units (GPUs) have become dominant accelerators for Machine Learning (ML) and High-Performance Computing (HPC) applications due to their massive parallelism capabilities, through ...the utilization of general matrix-to-matrix multiplication (GEMM) kernels. However, GEMM kernels often suffer from duplicated memory requests, mainly caused by matrix tiling used for handling large matrices. While GPUs have adopted programmable shared memory to mitigate this issue by preserving frequently reused data in shared memory, GEMM still introduces duplication in register files. Our observations show that the matrix tiling issues memory requests to the same shared memory address for neighboring threads, and this results in a substantial increase in the number of duplicated data in the register files. Such duplication degrades GPU performance by limiting warp-level parallelism due to the register shortage and redundant memory requests to shared memory. We find that the data duplication can be categorized into two types that occur with fixed patterns during the matrix tiling. Based on these observations, we introduce SHREG, an architecture design that enables different threads to share registers for overlapped data from shared memory, effectively reducing duplicated data within the register files. By leveraging the duplication patterns, SHREG utilizes register sharing and improves performance with minimal hardware overhead. Our evaluation shows that SHREG improves performance by 31.4% on various ML applications over the baseline GPU.