We propose a fast collision detection method that uses graphics hardware and includes self-collision and self-collision and deformable objects as its targets. The method uses a layered depth image ...(LDI), which is generated by depth peeling, and transform feedback. We modify the depth peeling method so that it can peel object surfaces exactly and uses LDI representations for fast collision detection. In addition, transform feedback makes the collision detection process very fast since there is no need to read back from the GPU to the CPU. The proposed method was implemented in a PC having an NVIDIA GeForce 9800 GT graphics card and applied to two objects consisting of 10,000 triangles in total at an image-space resolution of 400 × 400 pixels. As a result, the collision detection time was about 24 ms with 96% detection.
The article presents the assumptions and architectural implementation of a cargo transport control support system, as well as the preliminary results of tests of the operation of this system in ...environmental conditions identical with the conditions of natural operation. The aim of the article is the publication of the research results collected during the design, implementation and operation of the system, which relate to the proposed solutions for selected problems that emerged during the development of the system by the research and development team. The article also presents the completed development work and possibilities in the field of the system improvement due to security reasons through integration with modern data processing systems based on a block chain (Blockchain).
Recent work has shown that building GPUs with hundreds of SMs in a single monolithic chip will not be practical due to slowing growth in transistor density, low chip yields, and photoreticle ...limitations. To maintain performance scalability, proposals exist to aggregate discrete GPUs into a larger virtual GPU and decompose a single GPU into multiple-chip-modules with increased aggregate die area. These approaches introduce non-uniform memory access (NUMA) effects and lead to decreased performance and energy-efficiency if not managed appropriately. To overcome these effects, we propose a holistic Locality-Aware Data Management (LADM) system designed to operate on massive logical GPUs composed of multiple discrete devices, which are themselves composed of chiplets. LADM has three key components: a threadblock-centric index analysis, a runtime system that performs data placement and threadblock scheduling, and an adaptive cache insertion policy. The runtime combines information from the static analysis with topology information to proactively optimize data placement, threadblock scheduling, and remote data caching, minimizing off-chip traffic. Compared to state-of-the-art multi-GPU scheduling, LADM reduces inter-chip memory traffic by 4× and improves system performance by 1.8× on a future multi-GPU system.
GPGPU Programming for Dipolar Field Calculation Pricop, Sebastian; Ababei, Răzvan Vasile
BULETINUL INSTITUTULUI POLITEHNIC DIN IAȘI. Secția Matematica. Mecanică Teoretică. Fizică,
01/2024, Letnik:
70, Številka:
1
Journal Article
Recenzirano
Odprti dostop
Accelerating computational processes is paramount in numerical infrastructure development, particularly in applications such as the finite element method (FEM) and extensive calculations for ...simulating 3D processes in materials. In this work, we introduce a novel technique for computing the magnetostatic field of an ellipsoid particle, leveraging CUDA on a graphical card for parallel processing. The implementation on a GPU resulted in a remarkable 20-fold improvement in calculation speed. This achievement not only expedites research tasks, but also enables the exploration of larger and more intricate simulations, facilitating quicker model refinements and deeper insights into material behaviours under various conditions. The utilization of GPU computing aligns with the broader trend in scientific research and engineering, offering a versatile solution for diverse computational challenges beyond this specific task of magnetism. Overall, our work contributes to the ongoing effort to harness high-performance computing (HPC) technologies for accelerated and more efficient simulations in materials science and related fields.
This paper will show a comparison between the Kepler, Maxwell and Pascal GPU architectures using CUDA-Fortran, with and without dynamic calls, to efficiently solve partial differential equations. The ...target is to show the possibility of using affordable hardware, such as the GTX670, GTX970 and GTX1080 NVIDIA GPUs, which are commonly found in personal and portable computers, for scientific applications. For simplicity we consider a standard wave equation where we use a second order finite difference method for the spatial and time discretizations to obtain the numerical solution. We found that, as we increase the spatial resolution of the domain we also increase the performance difference between the GPU and the Central Processing Unit (CPU).
The Finite-Difference Time-Domain (FDTD) method is a popular numerical modelling technique in computational electromagnetics. The volumetric nature of the FDTD technique means simulations often ...require extensive computational resources (both processing time and memory). The simulation of Ground Penetrating Radar (GPR) is one such challenge, where the GPR transducer, subsurface/structure, and targets must all be included in the model, and must all be adequately discretised. Additionally, forward simulations of GPR can necessitate hundreds of models with different geometries (A-scans) to be executed. This is exacerbated by an order of magnitude when solving the inverse GPR problem or when using forward models to train machine learning algorithms.
We have developed one of the first open source GPU-accelerated FDTD solvers specifically focused on modelling GPR. We designed optimal kernels for GPU execution using NVIDIA’s CUDA framework. Our GPU solver achieved performance throughputs of up to 1194 Mcells/s and 3405 Mcells/s on NVIDIA Kepler and Pascal architectures, respectively. This is up to 30 times faster than the parallelised (OpenMP) CPU solver can achieve on a commonly-used desktop CPU (Intel Core i7-4790K). We found the cost–performance benefit of the NVIDIA GeForce-series Pascal-based GPUs – targeted towards the gaming market – to be especially notable, potentially allowing many individuals to benefit from this work using commodity workstations. We also note that the equivalent Tesla-series P100 GPU – targeted towards data-centre usage – demonstrates significant overall performance advantages due to its use of high-bandwidth memory. The performance benefits of our GPU-accelerated solver were demonstrated in a GPR environment by running a large-scale, realistic (including dispersive media, rough surface topography, and detailed antenna model) simulation of a buried anti-personnel landmine scenario.
Program Title: gprMax
Program Files doi:http://dx.doi.org/10.17632/kjjm4z87nj.1
Licensing provisions: GPLv3
Programming language: Python, Cython, CUDA
Journal reference of previous version: Comput. Phys. Comm., 209 (2016), 163–170
Does the new version supersede the previous version?: Yes
Reasons for the new version: Performance improvements due to implementation of CUDA-based GPU engine
Summary of revisions: A FDTD solver has been written in CUDA for execution on NVIDIA GPUs. This is in addition to the existing FDTD solver which has been parallelised using Cython/OpenMP for running on CPUs.
Nature of problem: Classical electrodynamics
Solution method: Finite-Difference Time-Domain (FDTD)
Homomorphic encryption (HE) offers great capabilities that can solve a wide range of privacy-preserving computing problems. This tool allows anyone to process encrypted data producing encrypted ...results that only the decryption key’s owner can decrypt. Although HE has been realized in several public implementations, its performance is quite demanding. The reason for this is attributed to the huge amount of computation required by secure HE schemes. In this work, we present a CUDAbased implementation of the Fan and Vercauteren (FV) Somewhat HomomorphicEncryption (SHE) scheme. We demonstrate several algebraic tools such as the Chinese Remainder Theorem (CRT), Residual Number System (RNS) and Discrete Galois Transform (DGT) to accelerate and facilitate FV computation on GPUs. We also show how the entire FV computation can be done on GPU without multi-precision arithmetic. We compare our GPU implementation with two mature state-of-the-art implementations: 1) Microsoft SEAL v2.3.0-4 and 2) NFLlib-FV. Our implementation outperforms them and achieves on average 5.37x, 7.37x, 22.22x, 5.11x and 13.18x (resp. 2.03x, 2.94x, 27.86x, 8.53x and 18.69x) for key generation, encryption, decryption, homomorphic addition and homomorphic multiplication against SEAL-FVRNS (resp. NFLlib-FV).
This work explores the feasibility of real-time large-eddy simulations of flow over urban canopies at the neighborhood scale. The cumulant lattice Boltzmann method is employed using a single General ...Purpose Graphic Processing Unit (GPGPU). In order to demonstrate the validity and efficiency of this approach we simulate wind flow in a neighborhood of Basel. Simulation results are validated against measurements from the Basel Urban Boundary Layer Experiment (BUBBLE) and are compared to previous CFD simulations. Turbulence statistics are found to be in agreement with corresponding tower measurements according to several validation metrics. While quantitative comparisons are limited to the six measurement locations of the field measurements, the available data supports the conjecture that real-time simulation of urban air flow is feasible at the neighborhood scale with the proposed numerical technique.
•LES of urban air flow of a neighborhood of Basel, Switzerland.•Simulation results are validated against tower measurements.•First application of cumulant LBM with quartic parametrization to real world problem.•It is shown, that LBM on GPGPUs can offer real-time simulations for urban air flows.
Accel-sim Khairy, Mahmoud; Shen, Zhesheng; Aamodt, Tor M. ...
2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA),
05/2020
Conference Proceeding
Odprti dostop
In computer architecture, significant innovation frequently comes from industry. However, the simulation tools used by industry are often not released for open use, and even when they are, the exact ...details of industrial designs are not disclosed. As a result, research in the architecture space must ensure that assumptions about contemporary processor design remain true.
To help bridge the gap between opaque industrial innovation and public research, we introduce three mechanisms that make it much easier for GPU simulators to keep up with industry. First, we introduce a new GPU simulator frontend that minimizes the effort required to simulate different machine ISAs through trace-driven simulation of NVIDIA's native machine ISA, while still supporting execution-driven simulation of the virtual ISA. Second, we extensively update GPGPU-Sim's performance model to increase its level of detail, configurability and accuracy. Finally, surrounding the new frontend and flexible performance model is an infrastructure that enables quick, detailed validation. A comprehensive set of microbenchmarks and automated correlation plotting ease the modeling process.
We use these three new mechanisms to build Accel-Sim, a detailed simulation framework that decreases cycle error 79 percentage points, over a wide range of 80 workloads, consisting of 1,945 kernel instances. We further demonstrate that Accel-Sim is able to simulate benchmark suites that no other open-source simulator can. In particular, we use Accel-sim to simulate an additional 60 workloads, comprised of 11,440 kernel instances, from the machine learning benchmark suite Deepbench. Deepbench makes use of closed-source, hand-tuned kernels with no virtual ISA implementation. Using a rigorous counter-by-counter analysis, we validate Accel-Sim against contemporary GPUs.
Finally, to highlight the effects of falling behind industry, this paper presents two case-studies that demonstrate how incorrect baseline assumptions can hide new areas of opportunity and lead to potentially incorrect design decisions.
•A 3D GPGPU-parallelised FDEM is employed for modelling the impact fracture of glass.•A cohesive fracture model accounting for the rupture of glass is implemented into the FDEM.•The applicability of ...FDEM for analysing glass impact fracture mechanism is demonstrated through validated examples.•The influence of impact velocity, boundary condition and projectile nose shape on the fracture of glass is investigated.
Due to the brittleness and the wide use of glass in modern engineering applications, its vulnerability to impact actions and the corresponding fracture behaviour attracted growing attentions from academics and engineers. In this study, impact fracture responses of glass have been modelled and simulated using a 3D GPGPU-parallelised hybrid finite-discrete element method, i.e., the FDEM. Glass is discretised into discrete elements where finite element formulation is incorporated, enabling accurate predictions on contact forces and structural deformation. A cohesive fracture model accounting for the rupture of glass is implemented, and numerical examples are presented and validated with results from literatures. The influence of impact velocity, boundary condition and projectile nose shape on the fracture of glass has been investigated. It is found that: (i) fracture pattern changes with the change of velocity; (ii) a rigid boundary support can be used should no damage occur in the edge of glass; (iii) under the same circumstance, a larger contact surface results in more severe damage. The GPGPU-parallelised FDEM provides a practical, efficient and robust computational approach in analysing the impact transient dynamic behaviour of glass in 3D.