Recent work has shown that building GPUs with hundreds of SMs in a single monolithic chip will not be practical due to slowing growth in transistor density, low chip yields, and photoreticle ...limitations. To maintain performance scalability, proposals exist to aggregate discrete GPUs into a larger virtual GPU and decompose a single GPU into multiple-chip-modules with increased aggregate die area. These approaches introduce non-uniform memory access (NUMA) effects and lead to decreased performance and energy-efficiency if not managed appropriately. To overcome these effects, we propose a holistic Locality-Aware Data Management (LADM) system designed to operate on massive logical GPUs composed of multiple discrete devices, which are themselves composed of chiplets. LADM has three key components: a threadblock-centric index analysis, a runtime system that performs data placement and threadblock scheduling, and an adaptive cache insertion policy. The runtime combines information from the static analysis with topology information to proactively optimize data placement, threadblock scheduling, and remote data caching, minimizing off-chip traffic. Compared to state-of-the-art multi-GPU scheduling, LADM reduces inter-chip memory traffic by 4× and improves system performance by 1.8× on a future multi-GPU system.
3D tomographic imaging requires the computation of solutions to very large inverse problems. In many applications, iterative algorithms provide superior results, however, memory limits in available ...computing hardware restrict the size of problems that can be solved. For this reason, iterative methods are not normally used to reconstruct typical data sets acquired with lab based CT systems. We thus use state of the art techniques such as dual buffering to develop an efficient strategy to compute the required operations for iterative reconstruction. This allows the iterative reconstruction of volumetric images of arbitrary size using any number of GPUs, each with arbitrarily small memory. Strategies for both the forward and backprojection operators are presented, along with two regularization approaches that are easily generalized to other projection types or regularizers. The proposed improvement also accelerates reconstruction of smaller images on single or multiple GPU systems, providing faster code for time-critical applications. The resulting algorithm has been added to the TIGRE toolbox, a repository for iterative reconstruction algorithms for general CT, but this memory-saving and problem-splitting strategy can be easily adapted for use with other GPU-based tomographic reconstruction code.
•The article presents a novel way of multi-GPU reconstruction for very large CT images•The large reconstructions can be executed on local workstations (instead of HPCs)•The software tool is provided for free.•We show one of the largest CT recons. with iterative methods, computed on a desktop.
•Construction and implementation of new domain-decomposition based parallel algorithm is proposed for cPINNs and XPINNs methods.•The proposed algorithm adds another dimension of parallelism in SciML ...primarily driven by data and model parallelism.•The proposed algorithm is shown its scaling for CPU and CPU+GPU architectures.
We develop a distributed framework for the physics-informed neural networks (PINNs) based on two recent extensions, namely conservative PINNs (cPINNs) and extended PINNs (XPINNs), which employ domain decomposition in space and in time-space, respectively. This domain decomposition endows cPINNs and XPINNs with several advantages over the vanilla PINNs, such as parallelization capacity, large representation capacity, efficient hyperparameter tuning, and is particularly effective for multi-scale and multi-physics problems. Here, we present a parallel algorithm for cPINNs and XPINNs constructed with a hybrid programming model described by MPI + X, where X ∈{CPUs,GPUs}. The main advantage of cPINN and XPINN over the more classical data and model parallel approaches is the flexibility of optimizing all hyperparameters of each neural network separately in each subdomain. We compare the performance of distributed cPINNs and XPINNs for various forward problems, using both weak and strong scalings. Our results indicate that for space domain decomposition, cPINNs are more efficient in terms of communication cost but XPINNs provide greater flexibility as they can also handle time-domain decomposition for any differential equations, and can deal with any arbitrarily shaped complex subdomains. To this end, we also present an application of the parallel XPINN method for solving an inverse diffusion problem with variable conductivity on the United States map, using ten regions as subdomains.
We describe a highly optimized implementation of MPI domain decomposition in a GPU-enabled, general-purpose molecular dynamics code, HOOMD-blue (Anderson and Glotzer, 2013). Our approach is inspired ...by a traditional CPU-based code, LAMMPS (Plimpton, 1995), but is implemented within a code that was designed for execution on GPUs from the start (Anderson et al., 2008). The software supports short-ranged pair force and bond force fields and achieves optimal GPU performance using an autotuning algorithm. We are able to demonstrate equivalent or superior scaling on up to 3375 GPUs in Lennard-Jones and dissipative particle dynamics (DPD) simulations of up to 108 million particles. GPUDirect RDMA capabilities in recent GPU generations provide better performance in full double precision calculations. For a representative polymer physics application, HOOMD-blue 1.0 provides an effective GPU vs. CPU node speed-up of 12.5×.
We propose a multi-phase, multi-resolution SPH method for fluid/solid interaction with multi-GPU implementation and dynamic load balancing following the movement of the refinement regions. The ...primary design goal of this framework is to maintain the efficiency of the single-resolution SPH model running on a single GPU. To this end, a multi-background mesh is introduced, and the domain is regarded as a nested multi-domain with different resolutions. Validation using both a δ-SPH and Riemann SPH model is shown, and applications to the simulation of the water entry of a projectile with a high Froude number are considered, with comparisons to experimental data from three challenging test cases, showing the proposed model's ability to correctly reproduce the free surface evolution on water entry, the motion of the projectile, and the formation and evolution of multiple cavities depending on entry angle and velocity. An analysis of the computational performance and resolutions achieved (up to 120 million particles) is also provided across several test cases.
•A multi-GPU, multi-node implementation is developed using CUDA for the GPU execution.•The efficiencies of the multi-GPU and multi-resolution model are presented and validated.•The two-phase flow of three water entry cases of slender body is simulated using the model.•The SPH model can provide a sharp interface between the air and water for water entry problems.
We present a novel parallel algorithm for cloth simulation that exploits multiple GPUs for fast computation and the handling of very high resolution meshes. To accelerate implicit integration, we ...describe new parallel algorithms for sparse matrix-vector multiplication (SpMV) and for dynamic matrix assembly on a multi-GPU workstation. Our algorithms use a novel work queue generation scheme for a fat-tree GPU interconnect topology. Furthermore, we present a novel collision handling scheme that uses spatial hashing for discrete and continuous collision detection along with a non-linear impact zone solver. Our parallel schemes can distribute the computation and storage overhead among multiple GPUs and enable us to perform almost interactive simulation on complex cloth meshes, which can hardly be handled on a single GPU due to memory limitations. We have evaluated the performance with two multi-GPU workstations (with 4 and 8 GPUs, respectively) on cloth meshes with 0.5 -- 1.65M triangles. Our approach can reliably handle the collisions and generate vivid wrinkles and folds at 2 -- 5 fps, which is significantly faster than prior cloth simulation systems. We observe almost linear speedups with respect to the number of GPUs.
BackgroundThe accuracy of Convolution/Superposition (CS) algorithm is considered to be next to Monte Carlo algorithm (MC) for radiotherapy dose calculation algorithm. Although the calculating speed ...of this algorithm is much faster than that of MC, its calculating speed can not fully meet the clinical requirements. With the aid of a single graphics processing unit GPU (Tesla C1060), the CS algorithm can be accelerated to 60 times faster than the traditional CPU serial calculation. The calculating time for single field is about 1 min which can be used in some simple three dimensional conformal radiotherapy planning (3DCRT), but this calculating speed does not satisfy the speed need for intensity modulated radiation therapy (IMRT) planning.PurposeThis study aims to explore a faster calculating speed solution of CS algorithm applied to IMRT with multi GPU.MethodsThe acceleration scheme of CPU + multi GPU heterogeneous model was analyzed by using different number of GPUs. High-end GPU, i.e., Tesla C2015, was used
Recent advances in the development of Eulerian incompressible smoothed particle hydrodynamics (EISPH), such as high-order convergence and natural coupling with Lagrangian formulations, demonstrate ...its potential as a meshless alternative to traditional computational fluid dynamics (CFD) methods. This work aims to address one of the major outstanding limitations of EISPH, its relatively high computational cost, by providing an implementation that can be deployed on multiple graphics processing units (GPUs). To this end, a pre-existing multi-GPU version of the open-source Lagrangian weakly-compressible code DualSPHysics is converted to an EISPH formulation and integrated with an open-source multi-GPU multigrid solver (AmgX) to treat the pressure Poisson equation implicitly. The integration of AmgX within DualSPHysics presents a significant challenge, since AmgX is designed for distributed systems and therefore conflicts with the single-node shared memory design of the multi-GPU DualSPHysics code. The present implementation is validated against well-known test cases, showing excellent agreement with benchmark solutions and demonstrating second-order convergence. A detailed profiling and performance testing is also presented to investigate memory consumption and scaling characteristics. The results show approximately 87%–95% strong scaling efficiency and 92%–94% weak scaling efficiency in both 2D and 3D on up to four GPUs. Large spikes in memory consumption during the initialisation of the linear solver library are found to impede full utilisation of the device memory. Nevertheless, the present implementation is shown to permit problem sizes on the order of 69.3 million (2D) and 20.8 million (3D) particles on four GPUs (128 GB total device memory), which is beyond what has previously been reported for incompressible SPH on GPUs, where the PPE is treated implicitly, and demonstrates its potential as an alternative to traditional CFD methods.
•Unified single-node curved boundary conditions based on TLBM are proposed with higher accuracy and parallel scalability than conventional schemes.•Parametric combined momentum-exchange method ...accurately simulates the lift coefficients for airfoils in transitional flows using Multi-GPU computation.•Simulated average Nussel numbers Nu¯ for forced convection of airfoils similar to the experimental results are equivalent to and modify the Hilpert correlation.
Laminar separation bubbles in transitional flows of low-Reynolds-number airfoils significantly affect aerodynamic efficiency and heat transfer characteristics, which can be dealt with using the numerical method with boundary treatment of complex geometry. This work proposes the unified single-node curved boundary conditions based on the thermal lattice Boltzmann method, and several benchmarks verify the higher accuracy and parallel scalability over conventional schemes. Multi-GPU simulations for airfoil Eppler 61 and SD8020 are carried out, and the lift coefficients CL are accurately simulated by the parametric combined momentum-exchange method. Furthermore, time-averaged pressure coefficients Cp and local Nussel number Nu of airfoil surfaces are obtained, and the unsteady flow and heat transfer characteristics of laminar separation bubbles are captured. Simulated average Nussel numbers Nu¯ for an Eppler 61 airfoil with different Reynold numbers are equivalent to and modify the Hilpert correlation. The numerical method can quickly and accurately predict transitional flows of low-Reynolds-number airfoils and guide the design and application of heat transfer with blade configuration.