We describe a highly optimized implementation of MPI domain decomposition in a GPU-enabled, general-purpose molecular dynamics code, HOOMD-blue (Anderson and Glotzer, 2013). Our approach is inspired ...by a traditional CPU-based code, LAMMPS (Plimpton, 1995), but is implemented within a code that was designed for execution on GPUs from the start (Anderson et al., 2008). The software supports short-ranged pair force and bond force fields and achieves optimal GPU performance using an autotuning algorithm. We are able to demonstrate equivalent or superior scaling on up to 3375 GPUs in Lennard-Jones and dissipative particle dynamics (DPD) simulations of up to 108 million particles. GPUDirect RDMA capabilities in recent GPU generations provide better performance in full double precision calculations. For a representative polymer physics application, HOOMD-blue 1.0 provides an effective GPU vs. CPU node speed-up of 12.5×.
We present the Fluid Transport Accelerated Solver, FluTAS, a scalable GPU code for multiphase flows with thermal effects. The code solves the incompressible Navier-Stokes equation for two-fluid ...systems, with a direct FFT-based Poisson solver for the pressure equation. The interface between the two fluids is represented with the Volume of Fluid (VoF) method, which is mass conserving and well suited for complex flows thanks to its capacity of handling topological changes. The energy equation is explicitly solved and coupled with the momentum equation through the Boussinesq approximation. The code is conceived in a modular fashion so that different numerical methods can be used independently, the existing routines can be modified, and new ones can be included in a straightforward and sustainable manner. FluTAS is written in modern Fortran and parallelized using hybrid MPI/OpenMP in the CPU-only version and accelerated with OpenACC directives in the GPU implementation. We present different benchmarks to validate the code, and two large-scale simulations of fundamental interest in turbulent multiphase flows: isothermal emulsions in HIT and two-layer Rayleigh-Bénard convection. FluTAS is distributed through a MIT license and arises from a collaborative effort of several scientists, aiming to become a flexible tool to study complex multiphase flows.
Program Title: : Fluid Transport Accelerated Solver, FluTAS.
CPC Library link to program files:https://doi.org/10.17632/tp6k8wky8m.1
Developer's repository link:https://github.com/Multiphysics-Flow-Solvers/FluTAS.git.
Licensing provisions: MIT License.
Programming language: Fortran 90, parallelized using MPI and slab/pencil decomposition, GPU accelerated using OpenACC directives.
External libraries/routines: FFTW, cuFFT.
Nature of problem: FluTAS is a GPU-accelerated numerical code tailored to perform interface resolved simulations of incompressible multiphase flows, optionally with heat transfer. The code combines a standard pressure correction algorithm with an algebraic volume of fluid method, MTHINC 1.
Solution method: the code employs a second-order-finite difference discretization and solves the two-fluid Navier-Stokes equation using a projection method. It can be run both on CPU-architectures and GPU-architectures.
We present LBcuda, a GPU accelerated version of LBsoft, our open-source MPI-based software for the simulation of multi-component colloidal flows. We describe the design principles, the optimization ...and the resulting performance as compared to the CPU version, using both an average cost GPU and high-end NVidia GPU cards (V100 and the latest A100). The results show a substantial acceleration for the fluid solver reaching up to 200 GLUPS (Giga Lattice Updates Per Second) on a cluster made of 512 A100 NVIDIA cards simulating a grid of eight billion lattice points. These results open attractive prospects for the computational design of new materials based on colloidal particles.
Program Title: LBcuda
CPC Library link to program files:https://doi.org/10.17632/v6fvmzpcrn.1
Developer's repository link:https://github.com/copmat/LBcuda
Licensing provisions: 3-Clause BSD License
Programming language: CUDA Fortran
Nature of problem: Hydro-dynamics of colloidal multi-component systems and Pickering emulsions.
Solution method: Lattice-Boltzmann method solving the Navier-Stokes equations for the fluid dynamics within an Eulerian description. Particle solver describing colloidal particles within a Lagrangian representation coupled to the fluid solver. The numerical solution of the coupling algorithm includes the back reaction effects for each force terms according to a fluid-particle multi-scale paradigm.
We review the status of the Quantum ESPRESSO software suite for electronic-structure calculations based on plane waves, pseudopotentials, and density-functional theory. We highlight the recent ...developments in the porting to GPUs of the main codes, using an approach based on OpenACC and CUDA Fortran offloading. We describe, in particular, the results achieved on linear-response codes, which are one of the distinctive features of the Quantum ESPRESSO suite. We also present extensive performance benchmarks on different GPU-accelerated architectures for the main codes of the suite.
The 2DECOMP&FFT library is a software framework written in modern Fortran to build large-scale parallel applications. It is designed for applications using three-dimensional structured meshes with a ...particular focus on spatially implicit numerical algorithms. However, the library can be easily used with other discretisation schemes based on a structured layout and where pencil decomposition can apply. It is based on a general-purpose 2D pencil decomposition for data distribution and data Input Output (I/O). A 1D slab decomposition is also available as a special case of the 2D pencil decomposition. The library includes a highly scalable and efficient interface to perform three-dimensional Fast Fourier Transforms (FFTs). The library has been designed to be user-friendly, with a clean application programming interface hiding most communication details from application developers, and portable with support for modern CPUs and NVIDIA GPUs (support for AMD and Intel GPUs to follow).
SVE (Scalable Vector Extension) is Arm's new vector instruction extension targeting high performance workloads. SVE offers many opportunities to optimise compute intensive workloads but, with the ...availability of SVE-enabled hardware still years away, we have to rely on simulation techniques in order to evaluate our implementations. Working with simulators can be tricky and it comes with many limitations but, used properly, a simulator like Gem5 is a valuable tool that can provide an opportunity to explore the possibilities opened by this new extension. As a use case, we focus our attention on the field of genomics, where the recent advent of high-throughput sequencing machines producing big amounts of genomic data has boosted the interest in efficient approximate string matching and alignment techniques. Genomics algorithms are the key computational building blocks for the downstream data analysis on resequencing projects where hundreds of GBytes of sequenced data are analysed against a reference genome in order to filter sequencing errors and detect potential genomic variation events. The computational requirements and sheer size of the input data used by these applications make them a challenging problem. Fortunately, they also exhibit a high degree of data parallelism, making them good candidates for vectorisation techniques. In this work we explore the unique opportunities that SVE provides in order to exploit the parallelism present in genomics algorithms. We discuss preliminary results, our simulation strategy, some of the obstacles and limitations we faced, and how to work around them in order to obtain meaningful results.
We describe a highly optimized implementation of MPI domain decomposition in a GPU-enabled, general-purpose molecular dynamics code, HOOMD-blue (Anderson and Glotzer, 2013). Our approach is inspired ...by a traditional CPU-based code, LAMMPS (Plimpton, 1995), but is implemented within a code that was designed for execution on GPUs from the start (Anderson et al., 2008). The software supports short-ranged pair force and bond force fields and achieves optimal GPU performance using an autotuning algorithm. We are able to demonstrate equivalent or superior scaling on up to 3375 GPUs in Lennard-Jones and dissipative particle dynamics (DPD) simulations of up to 108 million particles. GPUDirect RDMA capabilities in recent GPU generations provide better performance in full double precision calculations. For a representative polymer physics application, HOOMD-blue 1.0 provides an effective GPU vs. CPU node speed-up of 12.5 ×.
This paper assesses and reports the experience of ten teams working to port,validate, and benchmark several High Performance Computing applications on a novel GPU-accelerated Arm testbed system. The ...testbed consists of eight NVIDIA Arm HPC Developer Kit systems built by GIGABYTE, each one equipped with a server-class Arm CPU from Ampere Computing and A100 data center GPU from NVIDIA Corp. The systems are connected together using Infiniband high-bandwidth low-latency interconnect. The selected applications and mini-apps are written using several programming languages and use multiple accelerator-based programming models for GPUs such as CUDA, OpenACC, and OpenMP offloading. Working on application porting requires a robust and easy-to-access programming environment, including a variety of compilers and optimized scientific libraries. The goal of this work is to evaluate platform readiness and assess the effort required from developers to deploy well-established scientific workloads on current and future generation Arm-based GPU-accelerated HPC systems. The reported case studies demonstrate that the current level of maturity and diversity of software and tools is already adequate for large-scale production deployments.