The lattice Boltzmann method (LBM) is an algorithm to simulate fluid flows with the advantage of locality and simplicity, which is suitable for GPU acceleration and simulation of complex flows. ...However, LBM simulations involving complex solid boundaries require each boundary node to be aware of the types of all its neighbor nodes, i.e., fluid or solid, during the execution of boundary conditions, which involves tremendous data transfer between global and local memory on GPU. Such data transfer operations constitute a large portion of consumed time and can significantly affect simulation efficiency. This article proposes a novel boundary processing scheme that encodes the neighbor nodes' information into a single integer and stores it on the local node. We choose two- and three-dimensional porous-medium flows to test the performance of the proposed scheme on complex boundary geometries and compare it with the usual schemes that retrieve information redundantly from neighbors. The comparison shows that our proposed scheme can improve the overall computing efficiency by up to 40% for 3D flow simulations through porous media. Such improvement is achieved by reducing time consumption on data transfer.
•Novel scheme encodes neighbor nodes' info into single integers by binary encoding, and stores them locally.•Single integer includes local and neighboring node types, boundary condition types.•Proposed scheme improves GPU efficiency by up to 40% for 3D simulation of flow through porous media.•Approach can be applied to other methods like indirect addressing treatment to enhance performance further.
This paper proposes the implementation of a zero-order Takagi-Sugeno-Kang (TSK)-type fuzzy neural network (FNN) on graphic processing units (GPUs) to reduce training time. The software platform that ...this study uses is the compute unified device architecture (CUDA). The implemented FNN uses structure and parameter learning in a self-constructing neural fuzzy inference network because of its admirable learning performance. FNN training is conventionally implemented on a single-threaded CPU, where each input variable and fuzzy rule is serially processed. This type of training is time consuming, especially for a high-dimensional FNN that consists of a large number of rules. The GPU is capable of running a large number of threads in parallel. In a GPU-implemented FNN (GPU-FNN), blocks of threads are partitioned according to parallel and independent properties of fuzzy rules. Large sets of input data are mapped to parallel threads in each block. For memory management, this research suitably divides the datasets in the GPU-FNN into smaller chunks according to fuzzy rule structures to share on-chip memory among multiple thread processors. This study applies the GPU-FNN to different problems to verify its efficiency. The results show that to train an FNN with GPU implementation achieves a speedup of more than 30 times that of CPU implementation for problems with high-dimensional attributes.
Open modification searching (OMS) is a powerful search strategy to identify peptides with any type of modification. OMS works by using a very wide precursor mass window to allow modified spectra to ...match against their unmodified variants, after which the modification types can be inferred from the corresponding precursor mass differences. A disadvantage of this strategy, however, is the large computational cost, because each query spectrum has to be compared against a multitude of candidate peptides. We have previously introduced the ANN-SoLo tool for fast and accurate open spectral library searching. ANN-SoLo uses approximate nearest neighbor indexing to speed up OMS by selecting only a limited number of the most relevant library spectra to compare to an unknown query spectrum. Here we demonstrate how this candidate selection procedure can be further optimized using graphics processing units. Additionally, we introduce a feature hashing scheme to convert high-resolution spectra to low-dimensional vectors. On the basis of these algorithmic advances, along with low-level code optimizations, the new version of ANN-SoLo is up to an order of magnitude faster than its initial version. This makes it possible to efficiently perform open searches on a large scale to gain a deeper understanding about the protein modification landscape. We demonstrate the computational efficiency and identification performance of ANN-SoLo based on a large data set of the draft human proteome. ANN-SoLo is implemented in Python and C++. It is freely available under the Apache 2.0 license at https://github.com/bittremieux/ANN-SoLo.
GPU Computing Owens, John D.; Houston, Mike; Luebke, David ...
Proceedings of the IEEE,
05/2008, Letnik:
96, Številka:
5
Journal Article
Recenzirano
The graphics processing unit (GPU) has become an integral part of today's mainstream computing systems. Over the past six years, there has been a marked increase in the performance and capabilities ...of GPUs. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU counterpart. The GPU's rapid increase in both programmability and capability has spawned a research community that has successfully mapped a broad range of computationally demanding, complex problems to the GPU. This effort in general-purpose computing on the GPU, also known as GPU computing, has positioned the GPU as a compelling alternative to traditional microprocessors in high-performance computer systems of the future. We describe the background, hardware, and programming model for GPU computing, summarize the state of the art in tools and techniques, and present four GPU computing successes in game physics and computational biophysics that deliver order-of-magnitude performance gains over optimized CPU applications.
Simulating quantum many-body dynamics is important both for fundamental understanding of physics and practical applications for quantum information processing. Therefore, classical simulation methods ...have been developed so far. Specifically, the Trotter-Suzuki decomposition can analyze a highly complex quantum dynamics, if the number of qubits is sufficiently small so that main memory can store the state vector. However, simulation of quantum dynamics via Trotter-Suzuki decomposition requires a huge number of steps, each of which accesses the state vector, and hence the simulation time becomes impractically long. To settle this issue, we propose a technique to accelerate simulation of quantum dynamics via simultaneous diagonalization of mutually commuting Pauli groups, which is also attracting a lot of attention to reduce the measurement overheads in quantum algorithms. We group the Hamiltonian into mutually commuting Pauli strings, and each of them is diagonalized in the computational basis via a Clifford transformation. Since diagonal operators are applied on the state vector simultaneously with minimum memory access, this method successfully uses performance of highly parallel processors such as Graphics Processing Units (GPU). Compared to both CPU and GPU implementations using fast quantum circuit simulator “qulacs,” the numerical experiments have shown that our method provides a few tens of times acceleration.
This paper presents a new and efficient algorithm for the calculation of sub-grid distances in the context of a lattice Boltzmann method (LBM). LBMs usually operate on equidistant Cartesian grids and ...represent moving geometries by either using immersed boundary conditions or dynamic fill algorithms in combination with slip or no-slip boundary conditions. In order to obtain sufficiently high geometric accuracy, the sub-grid distances from Eulerian fluid nodes on uniform and structured grids to a tessellated triangular surface mesh have to be calculated. The proposed algorithm extends a previously published grid generation procedure by an efficient calculation of sub-grid distances. The algorithm is optimized for massively parallel execution on graphics processing units (GPUs). Based on a linearized representation of the obstacle surface, surface normal vectors are computed and stored, which then serve to compute the sub-grid distances. This saves GPU memory, re-uses information that is available from the surface voxelization step, and has shown to be very accurate and efficient for the implementation in a state-of-the-art LBM-GPU solver.
AbstractThe performance of discontinuous deformation analysis (DDA) needs to be improved for large-scale analysis. In this study, the contact detection and open-close iteration, as the bottlenecks of ...DDA computing, are reimplemented on graphics processing units (GPUs). For contact detection, the proposed parallel method accelerates DDA computing by maintaining a high loading balance and data reuse ratio on the GPU. The open-close iteration is divided into two parts, simultaneous equations solver and interpenetration checker. For the simultaneous equations solver, both the parallel Jacobi method and the block Jacobi preconditioned conjugate gradient (BJPCG) methods are implemented on the GPU to substitute the original successive overrelaxation (SOR) method. The parallel interpenetration checker is improved by optimizing conditional branches on the GPU. Two applications of the new parallel methods are introduced. The results showed that the broad and narrow phases of contact detection showed 18 and 5 times speedup, respectively. The parallel BJPCG simultaneous equations solver showed 16 times speedup, and the parallel interpenetration checker showed 2 times speedup. The total performance of DDA is improved about 2 and 10 times, respectively, using the proposed methods.
AbstractTwo-dimensional shallow-water schemes on Cartesian grids are amendable for graphics processing units and thus a convenient choice for fast flood simulations. A comparison of recent schemes ...and validation of important use cases is essential for developers and practitioners working with flood simulation tools. In this paper, we discuss three state-of-the-art shallow-water schemes: a first-order upwind scheme, a second-order upwind scheme, and a second-order central-upwind scheme. We analyze the advantages and disadvantages of each scheme on historical Danube river floods at three regions in Austria. We study the Lobau region as a floodplain with several small channels, the Wachau region with the meandering Danube in a steep valley, and the Marchfeld region located at the river confluence of March and Danube. The validation case studies show that the second-order schemes provide better estimates of the water levels than the first-order scheme. Still, the first order scheme is useful because it offers fast simulations and reasonable results at higher resolutions. The best trade-off between accuracy and computational effort for simulating river floods is provided by the second-order upwind scheme.
A graphics processing unit (GPU)-accelerated vector-form particle-element method, i.e., the finite particle method (FPM), is proposed for 3D elastoplastic contact of structures involving strong ...nonlinearities and computationally expensive contact calculations. A hexahedral FPM element with reduced integration and anti-hourglass is developed to model structural elastoplastic behaviors. The 3D space containing contact surfaces is decomposed into cubic cells and the contact search is performed between adjacent cells to improve search efficiency. A connected list data structure is used for storing contact particles to facilitate the parallel contact search procedure. The contact constraints are enforced by explicitly applying normal and tangential contact forces to the contact particles. The proposed method is fully accelerated by GPU-based parallel computing. After verification, the performance of the proposed method is compared with the serial finite element code Abaqus/Explicit by testing two large-scale contact examples. The maximum speedup of the proposed method over Abaqus/Explicit is approximately 80 for the overall computation and 340 for contact calculations. Therefore, the proposed method is shown to be effective and efficient.
Understanding the hydrodynamics of gas–solid flows is a grand challenge in mechanical and chemical engineering. The continuum-based two-fluid models (TFM) are currently not accurate enough to ...describe the multi-scale heterogeneity, while the discrete particle method (DPM) following the trajectory of each particle is computationally infeasible for industrial systems. Following our previous work, we report in this article a coarse-grained DPM considering the meso-scale structure based on the energy-minimization multi-scale (EMMS) model, which can be orders of magnitude faster than the traditional DPM and can take full advantage of CPU–GPU (graphics processing unit) hybrid supercomputing. The size and solids concentration of the coarse-grained particles (CGP), as well as their interactions with the gas flow (the drag) are determined by the EMMS model with a two-phase decomposition. The interactions between CGPs are determined according to the kinetic theory of granular flows (KTGF). The method is tested by simulating the onset of fluidization and the steady state flow in lab-scale circulating fluidized bed (CFB) risers with different geometries and operating conditions both in 2D and 3D. The results agree well with experiments and traditional DPM based on single particles. The prospect of this method as a higher-resolution alternative to TFM for engineering applications and even for virtual process engineering is discussed finally.
Display omitted
•A coarse-grained discrete particle method is proposed.•Coarse graining is based on energy minimization multi-scale model.•Restitution coefficient of coarse-grained particles determined by kinetic theory.•The method is validated by comparing pressure drop with Ergun equation.•The method is validated by experiments on two lab-scale risers.