•Unified single-node curved boundary conditions based on TLBM are proposed with higher accuracy and parallel scalability than conventional schemes.•Parametric combined momentum-exchange method ...accurately simulates the lift coefficients for airfoils in transitional flows using Multi-GPU computation.•Simulated average Nussel numbers Nu¯ for forced convection of airfoils similar to the experimental results are equivalent to and modify the Hilpert correlation.
Laminar separation bubbles in transitional flows of low-Reynolds-number airfoils significantly affect aerodynamic efficiency and heat transfer characteristics, which can be dealt with using the numerical method with boundary treatment of complex geometry. This work proposes the unified single-node curved boundary conditions based on the thermal lattice Boltzmann method, and several benchmarks verify the higher accuracy and parallel scalability over conventional schemes. Multi-GPU simulations for airfoil Eppler 61 and SD8020 are carried out, and the lift coefficients CL are accurately simulated by the parametric combined momentum-exchange method. Furthermore, time-averaged pressure coefficients Cp and local Nussel number Nu of airfoil surfaces are obtained, and the unsteady flow and heat transfer characteristics of laminar separation bubbles are captured. Simulated average Nussel numbers Nu¯ for an Eppler 61 airfoil with different Reynold numbers are equivalent to and modify the Hilpert correlation. The numerical method can quickly and accurately predict transitional flows of low-Reynolds-number airfoils and guide the design and application of heat transfer with blade configuration.
A new open source multi-GPU 2D flood model called TRITON is presented in this work. The model solves the 2D shallow water equations with source terms using a time-explicit first order upwind scheme ...based on an Augmented Roe's solver that incorporates a careful estimation of bed strengths and a local implicit formulation of friction terms. The scheme is demonstrated to be first order accurate, robust and able to solve for flows under various conditions. TRITON is implemented such that the model effectively utilizes heterogeneous architectures, from single to multiple CPUs and GPUs. Different test cases are shown to illustrate the capabilities and performance of the model, showing promising runtimes for large spatial and temporal scales when leveraging the computer power of GPUs. Under this hardware configuration, communication and input/output subroutines may impact the scalability. The code is developed under an open source license and can be freely downloaded in https://code.ornl.gov/hydro/triton.
•TRITON is released as a multi-GPU open source 2D hydrodynamic flood code.•It solves the 2D shallow water equations using a first order time-explicit scheme.•It runs efficiently for realistic configurations using heterogeneous architectures.•The runoff capability is demonstrated to be convenient for flood modeling.•Communication and I/O times may represent a bottleneck for operational purposes.
Graphical Processor Units (GPUs) are nowadays widely used in all-atom molecular simulations because of the advantage of efficient partitioning of atom pairs between the kernels to compute the ...contributions to energy and forces, thus enabling the treatment of very large systems. Extension of time- and size-scale of computations is also sought through the development of coarse-grained (CG) models, in which atoms are merged into extended interaction sites. Implementation of CG codes on the GPUs, particularly the multiple-GPU platforms is, however, a challenge due to more complicated potentials and removing the explicit solvent, forcing developers to do interaction- rather than space-domain decomposition. In this paper, we propose a design of a multi-GPU coarse-grained simulator and report the implementation of the heavily coarse-grained physics-based UNited RESidue (UNRES) model of polypeptide chains. By moving all computations to GPUs and keeping the communication with CPUs to a minimum, we managed to achieve almost 5-fold speed-up with 8 A100 GPU accelerators for systems with over 200,000 amino-acid residues, this result making UNRES the best scalable coarse-grained software and enabling us to do laboratory-time millisecond-scale simulations of such cell components as tubulin within days of wall-clock time.
Program Title: Multi-GPU UNRES
CPC Library link to program files:https://doi.org/10.17632/hz9s4nwncf.1
Developer's repository link:https://projects.task.gda.pl/eurohpcpl-public/unres
Licensing provisions: GPLv3
Programming language: Fortran + C++/CUDA
Nature of problem: Physics-based simulations of protein systems at biologically relevant time- and size-scale are demanding and consequently require both the simplification of biomolecule representation and substantial computational resources. UNRES (from UNited RESidue) is a physics-based reduced model of polypeptide chains with which to run large-scale coarse-grained simulations of protein structure and dynamics. It enables the researchers to study protein folding, protein dynamics, and protein-protein interactions in a physically realistic manner and further unveil biological processes' mechanisms. Examples of biological applications include studies of amyloid formations, signaling mechanism, and action of molecular chaperones.
Solution method: The presented Multi-GPU UNRES relies on a highly optimized GPU implementation of non-central forces using modern CUDA constructs. Fundamentally, it is possible by proposed efficient partitioning and assignment of the interaction domain onto GPU resources. We moved as many computations as possible to the device (GPU) side. In most cases, computations are defined and scheduled as CUDA graphs. In selected cases, scheduling kernels manually yields slightly better performance. To maximize parallelism, multiple CUDA streams are used. Furthermore, the code visibly benefits from a tree-based allreduce shared-memory-based algorithm. Additionally, if present within hardware, peer memory access is enabled between all GPUs and the allreduce algorithm takes advantage of it. This feature has made the UNRES coarse-grained protein model with implicit solvent scalable for multi-GPUs so that we could achieve almost 5-fold speed-up with 8 A100 GPU accelerators for systems with over 200,000 amino-acid residues.
•UNRES coarse-grained protein model with implicit solvent made scalable for multi-GPUs.•Efficient partitioning and assignment of the interaction domain onto GPU resources.•Highly optimized GPU implementation of non-central forces using modern CUDA constructs.
Subgraph matching is an important method of data mining in complex networks. In recent years, the subgraph matching algorithm based on GPU (graphics processing units) has shown obvious speed ...advantages.However, due to the large scale of graph data and a large number of intermediate results of subgraph matching, the memory capacity of a single GPU soon becomes the main bottleneck for processing subgraph matching algorithm of large graph. Therefore, this paper proposes a multi-GPU programming model for large graph subgraph matching. Firstly, the framework of subgraph matching algorithm based on multi-GPU is proposed, and the cooperative operation of subgraph matching algorithm on multi-GPU is realized, which solves the problem of graph scale of subgraph matching on GPU. Secondly, a dynamic adjustment technique based on query graph is used to deal with cross-partition subgraph sets, which solves the cross-partition subgraph matching problem caused by graph segmentation. Finally, based on the characteristics of S
Improving the computational efficiency of 3D FWI is a challenging task in seismic imaging. Using a multi-GPU cluster with acceleration strategy to simulate the wave propagation is an important means ...to improve its efficiency. We propose a multi-GPU acceleration 3D acoustic FWI algorithm based on the FDTD method in this paper. We improved the parallelism of 3D wavefield simulation algorithm based on a single GPU using a sliding 2D thread block algorithm with three different 2D shared memory stencils. For the multi-node implementation, we achieved bidirectional parallel data transfer between GPUs and used multiple kernels to further overlap the calculation and transfer. Numerical tests verify the validity of our 3D FWI algorithm accelerated with multi-GPU. The strategies used in our algorithm can significantly bring improvement in most cases. And the improvement is strongly related to the model size and the number of GPUs used. In our test, we achieve an acceleration of up to 19% in forward simulation and 25% in gradient calculation, compared with a typical multi-GPU implementation.
This paper shows the development of a multi-GPU version of a time-explicit finite volume solver for the Shallow-Water Equations (SWE) on a multi-GPU architecture. MPI is combined with CUDA-Fortran in ...order to use as many GPUs as needed and the METIS library is leveraged to perform a domain decomposition on the 2D unstructured triangular meshes of interest. A CUDA-Aware version of OpenMPI is adopted to speed up the messages between the MPI processes. A study of both speed-up and efficiency is conducted; first, for a classic dam-break flow in a canal, and then for two real domains with complex bathymetries. In both cases, meshes with up to 12 million cells are used. Using 24 to 28 GPUs on these meshes leads to an efficiency of 80% and more. Finally, the multi-GPU version is compared to the pure MPI multi-CPU version, and it is concluded that in this particular case, about 100 CPU cores would be needed to achieve the same performance as one GPU. The developed methodology is applicable for general time-explicit Riemann solvers for conservation laws.
•Multi-GPU version of a finite volume solver for the Shallow-Water Equations using CUDA and a CUDA-Aware version of OpenMPI.•Domain decomposition of 2D unstructured meshes using METIS with a specific renumbering for efficient memory exchange.•Achievement of a 21x speed-up when using 32 GPUs compared to utilizing a single GPU.•Comparison of the Multi-GPU and Multi-CPU versions of our in-house code shows that 8 GPUs perform as well as 1024 CPU cores.
We introduce an updated library, PaScaL_TDMA 2.0, which was originally designed for the efficient computation of batched tridiagonal systems and is now capable of exploiting multi-GPU environments. ...The library extends its functionality to include GPU support and minimizes CPU-GPU data transfer by utilizing the device-resident memory while retaining the original CPU-based capabilities. The library employs pipeline copying with shared memory for low-latency memory access and incorporates CUDA-aware MPI for efficient multi-GPU communication. Our GPU implementation demonstrated outstanding computational performance compared to the original CPU implementation while consuming much less energy. In summary, this updated version presents a time-efficient and energy-saving approach for solving batched tridiagonal systems on modern computing platforms, including both GPU and CPU.
Program Title: PaScaL_TDMA 2.0
CPC Library link to program files:https://doi.org/10.17632/49z6fh94z3.2
Developer's repository link:https://github.com/MPMC-Lab/PaScaL_TDMA
Licensing provisions: MIT
Programming language: CUDA Fortran. The program was tested using NVIDIA HPC SDK 22.7.
Journal reference of previous version: Comput. Phys. Comm. 260 (2021), 107722
Does the new version supersede the previous version?: Yes
Reasons for the new version: This version supports multi-GPU acceleration for solving batched tridiagonal systems of equations using the modified Thomas algorithm, which was originally implemented in the PaScaL_TDMA library. CUDA Fortran is used for the current implementation of PaScaL_TDMA to exploit the unique features of GPU, such as shared memory and CUDA-aware MPI.
Summary of revisions: PaScaL_TDMA 2.0 is a versatile library designed to solve many tridiagonal systems in multi-dimensional partial differential equations on both CPU and GPU platforms. It builds upon the original CPU version of the PaScaL_TDMA library initially proposed by Kim et al. 1 and extends its functionality to enable GPU acceleration.
Our updated library is equipped with several modifications to enhance its performance on multi-GPU platforms. First, all variables of the tridiagonal matrix algorithm (TDMA) in the GPU implementation are operated in the device-resident memory, minimizing the transfers between the host and GPU devices. Second, to accelerate GPU computation, we incorporated the CUDA kernel into the loop structure of the existing algorithm, utilizing pipeline copy techniques in shared memory during the forward elimination and backward substitution steps of TDMA. Consequently, PaScaL_TDMA 2.0, which minimizes global memory access, significantly improves performance. Furthermore, the library implements CUDA-aware MPI communication, thereby increasing parallel efficiency. This communication technique enables fast communication in systems such as NVlink, which supports direct GPU-to-GPU communication. Finally, the sequential Thomas algorithm 2 is employed instead of the PaScaL_TDMA algorithm to avoid unnecessary processes when only a single process is involved, without domain partitioning for both the CPU and GPU codes.
We evaluated the computational performance and energy efficiency of the GPU implementation of PaScaL_TDMA 2.0 on the NEURON cluster at the Korea Institute of Science and Technology Information (KISTI). The cluster consisted of two AMD EPYC 7543 processors (hosts) and eight NVLink-connected NVIDIA A100 GPUs (devices) per compute node. The results were compared with those obtained on the NURION cluster at KISTI, which features an Intel Xeon Phi 7250 Knight Landing (KNL) processor per compute node. Intel OneAPI 22.2 3 and NVIDIA HPC SDK 22.7 4 were used to compile PaScaL_TDMA 2.0 on the NURION and NEURON clusters, respectively. In our evaluation, for the KNL configuration, we used 64 cores per CPU, whereas, for the AMD configuration, we used cores corresponding to the number of GPUs. The cores are decomposed using the method proposed by Kim et al. 1.
Figure 1(a) presents the wall-clock time results against the number of CPUs/GPUs with two different grid sizes, 5123 and 10243, where all the results show strong scalability, regardless of the grid sizes or versions of the CPU/GPU. Remarkably, the computational performance of the A100 GPUs surpassed that of the KNL many-core CPUs, achieving a 4.34x and 6.43x speedup on average with 5123 and 10243 grid points, respectively. This GPU implementation exhibits strong scalability, even beyond eight GPUs, which implies that it can handle internode communications effectively.
Figure 1(b) shows the energy consumed by KNL CPUs and A100 GPUs for solving tridiagonal systems with grid sizes of 5123 and 10243. In the case of the results with A100 GPUs, the energy consumed by the AMD EPYC CPUs is also plotted. The energy consumption was evaluated using a time integral of the instantaneous power consumption, which was measured using the turbostat utility 5 for the CPU and the NVIDIA Management Library (NVML) 6 for GPU, respectively. In the case of 5123 grid points, an execution on A100 GPUs consumes only 8.5% of the energy required by KNL CPUs, achieving 11.8x more efficient computation than execution on KNL CPUs. Consistency was observed in the case of 10243 grid points, where A100 GPUs enable 11.6x energy-efficient executions than KNL CPUs, requiring 8.6% of the energy used by execution on KNL CPUs. These findings highlight the benefits of using this upgraded version for GPU clusters in terms of compute performance and energy consumption compared to the original version.
Nature of problem: This library solves batched tridiagonal systems involving multi-dimensional partial differential equations.
Solution method: The divide-and-conquer approach was employed to solve partitioned tridiagonal systems of equations in distributed memory systems. The modified Thomas algorithm for partitioned submatrices was applied to transform the systems into the modified forms, which subsequently constructed reduced tridiagonal systems through all-to-all communication. The reduced tridiagonal systems were solved using the sequential Thomas algorithm, whereas the solutions were distributed to update the remaining unknowns in the partitioned systems. The detailed computational procedures are described in Kim et al. 1, and all procedures were implemented using CUDA Fortran in this updated version.
1K.-H. Kim, J.-H. Kang, X. Pan, J.-I. Choi, Comput. Phys. Commun. 260 (2021) 107722, https://doi.org/10.1016/j.cpc.2020.107722.2L.H. Thomas, Watson Sci. Comput. Lab. Rept., Columbia University, New York 1 (1949) 71.3https://software.intel.com/content/www/us/en/develop/tools/oneapi.html.4https://docs.nvidia.com/hpc-sdk/index.html.5https://github.com/torvalds/linux/tree/master/tools/power/x86/turbostat.6https://docs.nvidia.com/deploy/nvml-api.
As progress in electronic structure theoretical methods is made, ab initio molecular dynamics (MD) based on orbital-free density functional theory (OF-DFT) is becoming increasingly more successful at ...substituting the traditional, very accurate but computationally costly Kohn–Sham (KS) approach for simulations of matter at the challenging warm dense matter (WDM) regime. However, despite the significant cost alleviation of eliminating the dependence on the KS orbitals, OF-DFT MD runs require ~102 to 103 CPU cores running for days, or even weeks, for simulations of systems comprised of 102 to 103 atoms, depending on thermodynamic conditions. Here we present DRAGON, a multi-GPU OF-DFT MD code for fast and efficient simulations of WDM. With a relatively small allocation of resources (4 to 8 GPU devices) it can provide an order of magnitude speedup for simulations containing $\mathscr{O}$(104) atoms and target systems composed of $\mathscr{O}$(105) atoms at conditions within the WDM regime, which is currently outside the capabilities of CPU codes.
Unprecedented climate change and anthropogenic activities have induced increasing ecohydrological problems, motivating the development of large‐scale hydrologic modeling for solutions. Water ...age/quality is as important as water quantity for understanding the terrestrial water cycle. However, scientific progress in tracking water parcels at large‐scale with high spatiotemporal resolutions is far behind that in simulating water balance/quantity owing to the lack of powerful modeling tools. EcoSLIM is a particle tracking model working with ParFlow‐CLM that couples integrated surface‐subsurface hydrology with land surface processes. Here, we demonstrate a parallel framework on distributed, multi‐Graphics Processing Unit platforms with Compute Unified Device Architecture‐Aware Message Passing Interface for accelerating EcoSLIM to continental‐scale. In tests from catchment‐, to regional‐, and then to continental‐scale using 25‐million to 1.6‐billion particles, EcoSLIM shows significant speedup and excellent parallel performance. The parallel framework is portable to atmospheric and oceanic particle tracking models, where the parallelization is inadequate, and a standard parallel framework is also absent. The parallelized EcoSLIM is a promising tool to accelerate our understanding of the terrestrial water cycle and the upscaling of subsurface hydrology to Earth System Models.
Plain Language Summary
Studies of water ages at multiple spatiotemporal scales are urgent to better understand the connections between different hydrologic compartments. Climate change and anthropogenic activities make this requirement more pressing. Lagrangian particle tracking is a powerful tool to simulate water ages. However, it is computationally demanding, which hampers its wide application. In this study, we provide a Lagrangian particle tracking model, EcoSLIM, with a novel parallel framework that enables it to handle large‐scale water age simulations with high spatiotemporal resolutions. We combined the efforts of engineers and scientists from multiple disciplines on this work which cannot be achieved by the knowledge of an individual discipline. To the best of our knowledge, such a modeling tool is absent in communities of hydrology and Earth Surface Processes. In tests from catchment‐, to regional‐, and then to continental‐scale using 25‐million to 1.6‐billion particles, EcoSLIM shows significant speedup and excellent parallel performance. Although we take EcoSLIM as an example here, the parallel framework is portable to other particle tracking models in Earth System Science, such as those in atmospheric and oceanic disciplines. The parallelized EcoSLIM is a promising tool to hydrologic community and Earth System Model developers for scientific exploration.
Key Points
Numerical models for large‐scale water age/quality simulations are absent in communities of hydrology and Earth Surface Processes
A parallel framework for accelerating Lagrangian particle tracking to continental‐scale on distributed, multi‐Graphics Processing Unit platforms is established
The parallelized particle tracking model, EcoSLIM, is a promising tool to accelerate our understanding of the terrestrial water cycle