Ray-tracing rendering has long been considered as a promising technology to enable a higher level of visual experience. The democratization of the ray-tracing rendering to consumer platforms, ...however, poses significant challenges to rendering hardware and software due to its highly irregular computing patterns. In fact, modern ray-tracing techniques typically depend on a tree-based acceleration structure to reduce the computing complexity of intersection testing of rays and graphics primitives. The traversal by a massive number of rays on a graphics processing unit (GPU) incurs a significant amount of irregular memory traffic, which turns out to be a major stumbling block for real-time performance. In this work, a scheduling mechanism, so-called Agglomerative Memory and Thread Scheduling, is proposed to unleash the inherence parallelism in the ray-tracing process on GPUs. It is associated with a tile-based ray-tracing framework in which the acceleration structure (i.e., KD-tree in this work) is partitioned into subtrees that can be completely loaded into the on-chip L1 cache inside a streaming multiprocessor. An effective scheduling mechanism collects threads with regard to the subtrees hit by their respective rays and regroup threads into warps for dispatching. In addition, subtrees are dynamically preloaded into the L1 cache of multiprocessors in an on-demand fashion. The proposed scheduler can be integrated on today's high-end GPUs with only minor overhead. Microarchitecture simulation results prove that the proposed framework significantly improves memory efficiency and outperforms a traditional GPU microarchitecture by 47.4% for average.
We present a CUDA-based parallel implementation on GPU architecture of a modified version of the Smoothed Particle Hydrodynamics (SPH) method. This modified formulation exploits a strategy based on ...the Taylor series expansion, which simultaneously improves the approximation of a function and its derivatives with respect to the standard formulation. The improvement in accuracy comes at the cost of an additional computational effort. The computational demand becomes increasingly crucial as problem size increases but can be addressed by employing fast summations in a parallel computational scheme. The experimental analysis showed that our parallel implementation significantly reduces the runtime, when compared to the CPU-based implementation.
Solvent plays an essential role in a variety of chemical, physical, and biological processes that occur in the solution phase. The reference interaction site model (RISM) and its three‐dimensional ...extension (3D‐RISM) serve as powerful computational tools for modeling solvation effects in chemical reactions, biological functions, and structure formations. We present the RISM integrated calculator (RISMiCal) program package, which is based on RISM and 3D‐RISM theories with fast GPU code. RISMiCal has been developed as an integrated RISM/3D‐RISM program that has interfaces with external programs such as Gaussian16, GAMESS, and Tinker. Fast 3D‐RISM programs for single‐ and multi‐GPU codes written in CUDA would enhance the availability of these hybrid methods because they require the performance of many computationally expensive 3D‐RISM calculations. We expect that our package can be widely applied for chemical and biological processes in solvent. The RISMiCal package is available at https://rismical-dev.github.io.
RISMiCal: A software package to perform fast RISM/3D‐RISM calculations for evaluation of solvation thermodynamics.
The finite element time-domain (FETD) method is an appealing electromagnetic (EM) field-solving procedure for the derivation of high-resolution field distribution of complex geological models. ...However, the arising computational burden has become a main bottleneck restricting the efficient implementation of ground penetrating radar (GPR) simulation at a high speed and large scale. In this research, we developed a high-efficiency EM solver to discretize the partial differential equations on unstructured triangular meshes, using a high-order graphics processing unit (GPU)-finite element discontinuous Galerkin time-domain (DGTD) method. By introducing a semi-discrete strong format instead of solving large stiffness matrices, and by employing the Runge-Kutta temporal integration scheme, the DGTD method can effectively overcome the problem of memory shortage and thereby solves the issue of instability. Besides, the uniaxial perfectly matched layer (UPML) is extended to match the lossy medium and used as an absorbing boundary condition to simulate an open space. Three models were manufactured to compare the DGTD method with the state-of-art methods (FDTD, FETD) of different grids in terms of computing accuracy, the dispersion degree of the rough interface, and memory occupied. To be specific, we explored the detailed effect of both the grid sizes and the order of basis functions on the modeling accuracy for the proposed GPU-DGTD method. We finally verified the numerical solution and demonstrated its applications by simulating a complex model for tunnel geological forecast. The experimental results reveal the order of basic function N and the size of the grid d are closely concerned with the wavelength λ of EM waves, for example, an appropriate definition is d/N ≈ λ/15.
•The unstructured-grid high-order GPU-DGTD is developed for GPR.•GPU-DGTD can be applied to complex models and have high accuracy and efficiency.•d/N ≈ λ/15 is the recommended formula for grid sizes and order of basis functions.•A high-order DGTD with large grid is better than a low-order one with small grid.
Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible, wide ...and deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.
The template-matching algorithm (TMA) has been widely adopted for improving the reliability of earthquake detection. The TMA is based on calculating the normalized cross-correlation coefficient (NCC) ...between a collection of selected template waveforms and the continuous waveform recordings of seismic instruments. In realistic applications, the computational cost of the TMA is much higher than that of traditional techniques. In this study, we provide an analysis of the TMA and show how the GPU architecture provides an almost ideal environment for accelerating the TMA and NCC-based pattern recognition algorithms in general. So far, our best-performing GPU code has achieved a speedup factor of more than 800 with respect to a common sequential CPU code. We demonstrate the performance of our GPU code using seismic waveform recordings from the ML 6.6 Meinong earthquake sequence in Taiwan.
•We have developed efficient GPU code for normalized cross-correlation coefficient (NCC) calculations.•Our current GPU code has achieved more than 800 times speedup with respect to a sequential CPU code.•Our GPU-based NCC code has been applied to the template-matching algorithm (TMA) for earthquake detection.
This study focuses on the implementation of the spectral difference (SD) method on hexahedral elements to NVIDIA graphics processing units (GPUs) using the Compute Unified Device Architecture (CUDA) ...for aeroacoustic problems. Three problems were addressed in the implementation of this study: thread parallelism strategy optimization within the GPU, data access patterns management, and multi-GPU parallelization implementation. Computational speed testing showed that the three factors significantly affect the efficiency of the code on the GPU. The implemented GPU solver was validated using an inviscid problem and a viscous problem. The numerical results show that the GPU solver achieves the same level of accuracy as the CPU program, with remarkable speed improvements. Specifically, compared with a single CPU core with a turbo boost frequency of 3.2 GHz (Intel Xeon Silver 4210), the inviscid case tested on an RTX 2070 Super GPU achieved acceleration of 122.4×, and the viscous case conducted on an RTX 3090 GPU achieved acceleration of 229.7×. Additionally, the GPU solver exhibits a parallel efficiency exceeding 93% when performing parallel computing on a platform with multiple RTX 3090 GPU cards. Furthermore, the GPU-accelerated computational aeroacoustics solver was applied to compute the noise from a low-speed propeller. The computed results were compared with experimental data, and the excellent agreement demonstrated the effectiveness and feasibility of the GPU implementation of the SD solver.
Particle tracking simulations with space charge effects are very important for high-intensity proton rings. Since they include not only Hamiltonian mechanics of a single particle but also ...constructing charge densities and solving Poisson equations to obtain the electromagnetic field due to the space charge, they are extremely time-consuming. We have newly developed a particle tracking simulation code that can be used in graphics processing units (GPUs). GPUs have strong capacities of parallel processing so that the calculation of single-particle mechanics can be done very fast by complete parallelization. Our new code also includes the space charge effect. It must construct charge densities, which cannot be completely parallelized. For the charge density construction, we can use "shared memory," which can be accessed very fast from each thread. The usage of shared memory is another advantage of GPU computing. As a result of our new development, we increase the speed of our particle tracking, including the space charge effect approximately ten times faster than that in the case of our conventional code used in CPU.
As a typical fluid–structure interaction (FSI) problem, water entry involves violent fluid flows and changing free surfaces, which presents great challenges for numerical modeling. Smoothed Particle ...Hydrodynamics (SPH) is a Lagrangian particle method that has natural advantages in modeling free surfaces and moving interfaces. However, SPH is computationally expensive due to the search of particle–particle interactions, and it causes great difficulties for performing large-scale simulations of 3D FSI problems. In this work, we present an accelerated SPH framework based on the Graphics Processing Unit (GPU) techniques to study water entry problems. The multi-threading programmed by Compute Unified Device Architecture (CUDA) is applied to enhance the computational performance in terms of efficiency and scale. Compared to the Single-CPU-based strategy, the newly presented GPU-accelerated SPH method is computationally more efficient with a speedup over hundreds times and enables a larger memory available for large-scale simulations of around ten million particles for three-dimensional cases. With the GPU-accelerated SPH method, the 3D water entry of a circular cylinder is investigated with some kinematic and dynamic characteristics explained. The results demonstrate that the rotational characteristic of a 3D cylinder in water entry is related to the dimensionless number γ defined as the ratio of the initial inclination angle to the initial velocity angle. The rotation of a cylinder changes from anticlockwise to clockwise with the increase in γ. A transition value of γ exists between the anticlockwise to clockwise rotation, which focuses on the range from 1.0 to 6.0. Meanwhile, the water entry of a 3D circular cylinder leads to a violent impact on the bottom of the cylinder, which causes a peak value of pressure being a maximum value at the early stage of the water entry. It is also indicated that the selection of the initial inclination angle has a great effect on the maximum pressure.
•A GPU-accelerated SPH method for fluid–structure interaction problems is developed.•The GPU-accelerated SPH method can greatly improve the computational ability compared to CPU-based SPH.•The water entry characteristics of a 3D cylinder relate to the ratio of the initial inclination angle to velocity angle.•A cross-region that the trajectory of the centroid always passes through is determined by the initial velocity angle.
Display omitted
Simulations of dendritic solidification involving melt convection and solid motion usually require a considerably higher computational domain than the dendrite size, whose ...computational efficiency with a uniform mesh is extremely low. In this study, to accelerate those two-dimensional simulations using the phase-field and lattice Boltzmann (PF-LB) methods, we developed a parallel computing method with multiple graphics processing units (GPUs) for the adaptive mesh refinement (AMR) method with dynamic load balancing (parallel-GPU AMR). It was confirmed that parallel-GPU AMR simulations were faster than those with the uniform mesh when the number of grid points in the adaptive mesh was around 40% or less than those in the uniform mesh. We also demonstrate that the developed parallel-GPU AMR can greatly accelerate the PF-LB simulations of dendrite growth with melt convection and solid motion.