Harnessing the power of modern multi-GPU architectures, we present a massively parallel simulation system based on the Material Point Method (MPM) for simulating physical behaviors of materials ...undergoing complex topological changes, self-collision, and large deformations. Our system makes three critical contributions. First, we introduce a new particle data structure that promotes coalesced memory access patterns on the GPU and eliminates the need for complex atomic operations on the memory hierarchy when writing particle data to the grid. Second, we propose a kernel fusion approach using a new Grid-to-Particles-to-Grid (G2P2G) scheme, which efficiently reduces GPU kernel launches, improves latency, and significantly reduces the amount of global memory needed to store particle data. Finally, we introduce optimized algorithmic designs that allow for efficient sparse grids in a shared memory context, enabling us to best utilize modern multi-GPU computational platforms for hybrid Lagrangian-Eulerian computational patterns. We demonstrate the effectiveness of our method with extensive benchmarks, evaluations, and dynamic simulations with elastoplasticity, granular media, and fluid dynamics. In comparisons against an open-source and heavily optimized CPU-based MPM codebase Fang et al. 2019 on an elastic sphere colliding scene with particle counts ranging from 5 to 40 million, our GPU MPM achieves over 100x per-time-step speedup on a workstation with an Intel 8086K CPU and a single Quadro P6000 GPU, exposing exciting possibilities for future MPM simulations in computer graphics and computational science. Moreover, compared to the state-of-the-art GPU MPM method Hu et al. 2019a, we not only achieve 2x acceleration on a single GPU but our kernel fusion strategy and Array-of-Structs-of-Array (AoSoA) data structure design also generalizes to multi-GPU systems. Our multi-GPU MPM exhibits near-perfect weak and strong scaling with 4 GPUs, enabling performant and large-scale simulations on a 10243 grid with close to 100 million particles with less than 4 minutes per frame on a single 4-GPU workstation and 134 million particles with less than 1 minute per frame on an 8-GPU workstation.
The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware a compelling platform for computationally demanding ...tasks in a wide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping general‐purpose computation to graphics hardware.
We begin with the technical motivations that underlie general‐purpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping general‐purpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in general‐purpose application development on graphics hardware.
Neural graphics primitives, parameterized by fully connected neural networks, can be costly to train and evaluate. We reduce this cost with a versatile new input encoding that permits the use of a ...smaller network without sacrificing quality, thus significantly reducing the number of floating point and memory access operations: a small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through stochastic gradient descent. The multiresolution structure allows the network to disambiguate hash collisions, making for a simple architecture that is trivial to parallelize on modern GPUs. We leverage this parallelism by implementing the whole system using fully-fused CUDA kernels with a focus on minimizing wasted bandwidth and compute operations. We achieve a combined speedup of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds, and rendering in tens of milliseconds at a resolution of 1920×1080.
We have developed a parallel implementation of an Elasto-Viscoplastic Fast Fourier Transform-based (EVPFFT) micromechanical solver to enable computationally efficient crystal plasticity modeling for ...polycrystalline materials. Our primary focus lies in achieving performance portability, allowing a single EVPFFT implementation to run optimally on various homogeneous architectures, including multi-core Central Processing Units (CPUs), as well as on heterogeneous computer architectures comprising multi-core CPUs and Graphics Processing Units (GPUs) from different vendors. To accomplish this goal, we have leveraged MATAR, a C++ software library that simplifies the creation and utilization of multidimensional dense or sparse matrix and array data structures. These data structures are designed to be portable across diverse architectures through the use of Kokkos, a performance-portable library. Additionally, we have employed the Message Passing Interface (MPI) to efficiently distribute the computational workload among processors. The heFFTe (Highly Efficient FFT for Exascale) library is used to facilitate the performance portability of the fast Fourier transforms (FFTs) computation. The computational performance of EVPFFT is evaluated and presented in terms of parallel scalability and simulation runtime on different high-performance computing (HPC) architectures. The utility of the developed framework to efficiently simulate the micro-mechanical fields in polycrystalline microstructures in engineering applications is discussed.
Program Title: EVPFFT
CPC Library link to program files:https://doi.org/10.17632/2k8579fyyv.1
Developer's repository link:https://github.com/lanl/Fierro
Licensing provisions: BSD 3-Clause License
Programming language: C++
External routines/libraries: MPI, Kokkos, MATAR, HeFFTe, HDF5
Nature of problem: EVPFFT is a crystal plasticity code designed to compute micro-mechanical fields within a polycrystalline representative volume element (RVE) and predict the macroscale response of the RVE.
Solution method: EVPFFT uses the periodic Green's function method in Fourier space to solve the field equations of static stress equilibrium in a periodic spatial domain.
•MPI+X (where X can be CUDA, HIP, SYCL, OpenMP, or Pthreads) implementation of the EVPFFT model is developed.•Achieved performance portability on diverse computing architectures, including CPUs and GPUs.•Demonstrated parallel scalability across different representative volume element sizes using both multi-CPUs and multi-GPUs.•Future-proofed the EVPFFT program for evolving high-performance computing platforms.
This paper is concerned with computational issues related to penalized quantile regression (PQR) with ultrahigh dimensional predictors. Various algorithms have been developed for PQR, but they become ...ineffective and/or infeasible in the presence of ultrahigh dimensional predictors due to the storage and scalability limitations. The variable updating schema of the feature-splitting algorithm that directly applies the ordinary alternating direction method of multiplier (ADMM) to ultrahigh dimensional PQR may make the algorithm fail to converge. To tackle this hurdle, we propose an efficient and parallelizable algorithm for ultrahigh dimensional PQR based on the three-block ADMM. The compatibility of the proposed algorithm with parallel computing alleviates the storage and scalability limitations of a single machine in the large-scale data processing. We establish the rate of convergence of the newly proposed algorithm. In addition, Monte Carlo simulations are conducted to compare the finite sample performance of the proposed algorithm with that of other existing algorithms. The numerical comparison implies that the proposed algorithm significantly outperforms the existing ones. We further illustrate the proposed algorithm via an empirical analysis of a real-world data set.
The coupled-wave equations (CWEs) in nonlinear optics are the fundamental starting point in the study, analysis, and understanding of various frequency conversion processes in dielectric media ...subjected to intense laser radiation. In this work, a useful package for the modeling of optical parametric oscillators (OPOs) based on the Split-Step Fourier Method algorithm is presented. The algorithm is scripted in the CUDA programming language in order to speed up the calculations and obtain results in a relatively short time frame by using a graphics processing unit (GPU). Our results show a speedup higher than 50X for vector size of 214 in comparison with the analogous code scripted for running only in CPU. The package implements the CWEs to model the propagation of light in second-order nonlinear crystals widely used in optical frequency conversion experiments. In addition, the code allows the user to adapt the cavity configuration by selecting the resonant electric fields and/or incorporating intracavity elements. The package is useful for modeling OPOs or other mathematically similar problems.
Program Title:cuOPO
CPC Library link to program files:https://doi.org/10.17632/5djxwg4fbp.1
Developer's repository link:https://github.com/alfredos84/cuOPO
Licensing provisions: MIT
Programming language: CUDA
Nature of problem: The problem that is solved in this work is that of two or three coupled differential equations that describe the propagation of light in a second order nonlinear medium, allowing the three-wave mixing process. By placing the medium in an optical cavity, an optical parametric oscillator is formed. The optical cavity is modeled by including the appropriate boundary conditions for the differential equations. As a result we obtain the electric fields of the interacting waves in the time and frequency domains.
Solution method: The coupled differential equations are solved using the well-known fixed-step Split-Step Fourier method. Due to the eventual computational demand that some problems may have, we chose to implement the coupled equations in the CUDA programming language. This allows us to significantly speed up simulations, thanks to the computing power provided by a graphics processing unit (GPU) card. The output files obtained are the interacting electric fields, which have to be analyzed during post-processing.
We propose a nodal stochastic generation and transmission expansion planning model that incorporates the output from high-resolution global climate models through load and generation availability ...scenarios. We implement our model in Pyomo and perform computational studies on a realistically-sized test case of the California electric grid in a high performance computing environment. We propose model reformulations and algorithm tuning to efficiently solve this large problem using a variant of the Progressive Hedging Algorithm. We utilize the parallelization capabilities and overall versatility of mpi-sppy, exploiting its hub-and-spoke architecture to concurrently obtain inner and outer bounds on an optimal expansion plan. Initial results show that instances with 360 representative days on a system with over 8,000 buses can be solved to within 5% of optimality in under 4 h of wall clock time, a first step towards solving a large-scale power system expansion planning problem across a wide range of climate-informed operational scenarios.
•Including climate projections into power system expansion plans helps resiliency.•Joint transmission, storage and generation expansion increases size of problem.•Stochastic optimization addresses uncertainty, but is computationally challenging.•Decomposition via Progressive Hedging Algorithm makes this problem tractable.•Parallel computing implementation allows quick solution times.
Due to the detectors being distributed in the target medium, vertical seismic profile (VSP) is a seismic observation method that has the advantages of high resolution and a high signal-to-noise ...ratio. Separating the mixed wavefields into upgoing and downgoing waves can obtain more obvious dynamic and kinematic characteristics of seismic waves, guiding subsequent imaging and interpretation. The traditional method is mainly based on the pickup of first breaks. Flattening the global seismic data by first breaks can enhance the seismic events through techniques like median filter and singular value decomposition (SVD). However, this method relies on high-precision pickup of first breaks and is limited by zero offset. To address this limitation, we introduce an improved median filtering separation method. This method employs separations of the local dip angles into positive and negative through multi-window scanning (MWS). Due to the high accuracy and robustness of this method in 2-D VSP data, we propose to use the local dip angle of the wavefields to median filter the wavefield through positive and negative angles to obtain upgoing and downgoing waves. This method for wavefield separation is optimized by iteratively identifying the directions of the seismic data. However, this optimization comes at the cost of increased computational requirements, especially under high-precision conditions. To help alleviate this problem, we use multi-thread parallel computing technique on a multi-core central processing unit (CPU) to improve computational efficiency. Finally, we validate the proposed method by testing it on synthetic seismic data and field VSP data, respectively. The results show that this wavefield separation method has advantages in terms of accuracy and robustness compared to the median filtering method based on the first break picking.
•The seismic attribute of slope field is used to help guide the median filtering to perform wavefield separating, helping enhance the quality of the VSP data.•Adaptive scanning window based on frequencies of wavelets is proposed for obtaining the slope field.•The Field VSP data with acceleration strategy is tested to confirm the robustness of the proposed method.