We propose a fast collision detection method that uses graphics hardware and includes self-collision and self-collision and deformable objects as its targets. The method uses a layered depth image ...(LDI), which is generated by depth peeling, and transform feedback. We modify the depth peeling method so that it can peel object surfaces exactly and uses LDI representations for fast collision detection. In addition, transform feedback makes the collision detection process very fast since there is no need to read back from the GPU to the CPU. The proposed method was implemented in a PC having an NVIDIA GeForce 9800 GT graphics card and applied to two objects consisting of 10,000 triangles in total at an image-space resolution of 400 × 400 pixels. As a result, the collision detection time was about 24 ms with 96% detection.
Recent work has shown that building GPUs with hundreds of SMs in a single monolithic chip will not be practical due to slowing growth in transistor density, low chip yields, and photoreticle ...limitations. To maintain performance scalability, proposals exist to aggregate discrete GPUs into a larger virtual GPU and decompose a single GPU into multiple-chip-modules with increased aggregate die area. These approaches introduce non-uniform memory access (NUMA) effects and lead to decreased performance and energy-efficiency if not managed appropriately. To overcome these effects, we propose a holistic Locality-Aware Data Management (LADM) system designed to operate on massive logical GPUs composed of multiple discrete devices, which are themselves composed of chiplets. LADM has three key components: a threadblock-centric index analysis, a runtime system that performs data placement and threadblock scheduling, and an adaptive cache insertion policy. The runtime combines information from the static analysis with topology information to proactively optimize data placement, threadblock scheduling, and remote data caching, minimizing off-chip traffic. Compared to state-of-the-art multi-GPU scheduling, LADM reduces inter-chip memory traffic by 4× and improves system performance by 1.8× on a future multi-GPU system.
GPGPU Programming for Dipolar Field Calculation Pricop, Sebastian; Ababei, Răzvan Vasile
BULETINUL INSTITUTULUI POLITEHNIC DIN IAȘI. Secția Matematica. Mecanică Teoretică. Fizică,
01/2024, Letnik:
70, Številka:
1
Journal Article
Recenzirano
Odprti dostop
Accelerating computational processes is paramount in numerical infrastructure development, particularly in applications such as the finite element method (FEM) and extensive calculations for ...simulating 3D processes in materials. In this work, we introduce a novel technique for computing the magnetostatic field of an ellipsoid particle, leveraging CUDA on a graphical card for parallel processing. The implementation on a GPU resulted in a remarkable 20-fold improvement in calculation speed. This achievement not only expedites research tasks, but also enables the exploration of larger and more intricate simulations, facilitating quicker model refinements and deeper insights into material behaviours under various conditions. The utilization of GPU computing aligns with the broader trend in scientific research and engineering, offering a versatile solution for diverse computational challenges beyond this specific task of magnetism. Overall, our work contributes to the ongoing effort to harness high-performance computing (HPC) technologies for accelerated and more efficient simulations in materials science and related fields.
The Finite-Difference Time-Domain (FDTD) method is a popular numerical modelling technique in computational electromagnetics. The volumetric nature of the FDTD technique means simulations often ...require extensive computational resources (both processing time and memory). The simulation of Ground Penetrating Radar (GPR) is one such challenge, where the GPR transducer, subsurface/structure, and targets must all be included in the model, and must all be adequately discretised. Additionally, forward simulations of GPR can necessitate hundreds of models with different geometries (A-scans) to be executed. This is exacerbated by an order of magnitude when solving the inverse GPR problem or when using forward models to train machine learning algorithms.
We have developed one of the first open source GPU-accelerated FDTD solvers specifically focused on modelling GPR. We designed optimal kernels for GPU execution using NVIDIA’s CUDA framework. Our GPU solver achieved performance throughputs of up to 1194 Mcells/s and 3405 Mcells/s on NVIDIA Kepler and Pascal architectures, respectively. This is up to 30 times faster than the parallelised (OpenMP) CPU solver can achieve on a commonly-used desktop CPU (Intel Core i7-4790K). We found the cost–performance benefit of the NVIDIA GeForce-series Pascal-based GPUs – targeted towards the gaming market – to be especially notable, potentially allowing many individuals to benefit from this work using commodity workstations. We also note that the equivalent Tesla-series P100 GPU – targeted towards data-centre usage – demonstrates significant overall performance advantages due to its use of high-bandwidth memory. The performance benefits of our GPU-accelerated solver were demonstrated in a GPR environment by running a large-scale, realistic (including dispersive media, rough surface topography, and detailed antenna model) simulation of a buried anti-personnel landmine scenario.
Program Title: gprMax
Program Files doi:http://dx.doi.org/10.17632/kjjm4z87nj.1
Licensing provisions: GPLv3
Programming language: Python, Cython, CUDA
Journal reference of previous version: Comput. Phys. Comm., 209 (2016), 163–170
Does the new version supersede the previous version?: Yes
Reasons for the new version: Performance improvements due to implementation of CUDA-based GPU engine
Summary of revisions: A FDTD solver has been written in CUDA for execution on NVIDIA GPUs. This is in addition to the existing FDTD solver which has been parallelised using Cython/OpenMP for running on CPUs.
Nature of problem: Classical electrodynamics
Solution method: Finite-Difference Time-Domain (FDTD)
We introduce OpenRAND, a C++17 library aimed at facilitating reproducible scientific research by generating statistically robust yet replicable random numbers in as little as two lines of code, ...overcoming some of the unnecessary complexities of existing RNG libraries. OpenRAND accommodates single and multi-threaded applications on CPUs and GPUs and offers a simplified, user-friendly API that complies with the C++ standard’s random number engine interface. It is lightweight; provided as a portable, header-only library. It is statistically robust: a suite of built-in tests ensures no pattern exists within single or multiple streams. Despite its simplicity and portability, it remains performant—matching and sometimes outperforming native libraries. Our tests, including a Brownian walk simulation, affirm its reproducibility and ease-of-use while highlight its computational efficiency, outperforming CUDA’s cuRAND by up to 1.8 times.
Accel-sim Khairy, Mahmoud; Shen, Zhesheng; Aamodt, Tor M. ...
2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA),
05/2020
Conference Proceeding
Odprti dostop
In computer architecture, significant innovation frequently comes from industry. However, the simulation tools used by industry are often not released for open use, and even when they are, the exact ...details of industrial designs are not disclosed. As a result, research in the architecture space must ensure that assumptions about contemporary processor design remain true.
To help bridge the gap between opaque industrial innovation and public research, we introduce three mechanisms that make it much easier for GPU simulators to keep up with industry. First, we introduce a new GPU simulator frontend that minimizes the effort required to simulate different machine ISAs through trace-driven simulation of NVIDIA's native machine ISA, while still supporting execution-driven simulation of the virtual ISA. Second, we extensively update GPGPU-Sim's performance model to increase its level of detail, configurability and accuracy. Finally, surrounding the new frontend and flexible performance model is an infrastructure that enables quick, detailed validation. A comprehensive set of microbenchmarks and automated correlation plotting ease the modeling process.
We use these three new mechanisms to build Accel-Sim, a detailed simulation framework that decreases cycle error 79 percentage points, over a wide range of 80 workloads, consisting of 1,945 kernel instances. We further demonstrate that Accel-Sim is able to simulate benchmark suites that no other open-source simulator can. In particular, we use Accel-sim to simulate an additional 60 workloads, comprised of 11,440 kernel instances, from the machine learning benchmark suite Deepbench. Deepbench makes use of closed-source, hand-tuned kernels with no virtual ISA implementation. Using a rigorous counter-by-counter analysis, we validate Accel-Sim against contemporary GPUs.
Finally, to highlight the effects of falling behind industry, this paper presents two case-studies that demonstrate how incorrect baseline assumptions can hide new areas of opportunity and lead to potentially incorrect design decisions.
•A 3D GPGPU-parallelised FDEM is employed for modelling the impact fracture of glass.•A cohesive fracture model accounting for the rupture of glass is implemented into the FDEM.•The applicability of ...FDEM for analysing glass impact fracture mechanism is demonstrated through validated examples.•The influence of impact velocity, boundary condition and projectile nose shape on the fracture of glass is investigated.
Due to the brittleness and the wide use of glass in modern engineering applications, its vulnerability to impact actions and the corresponding fracture behaviour attracted growing attentions from academics and engineers. In this study, impact fracture responses of glass have been modelled and simulated using a 3D GPGPU-parallelised hybrid finite-discrete element method, i.e., the FDEM. Glass is discretised into discrete elements where finite element formulation is incorporated, enabling accurate predictions on contact forces and structural deformation. A cohesive fracture model accounting for the rupture of glass is implemented, and numerical examples are presented and validated with results from literatures. The influence of impact velocity, boundary condition and projectile nose shape on the fracture of glass has been investigated. It is found that: (i) fracture pattern changes with the change of velocity; (ii) a rigid boundary support can be used should no damage occur in the edge of glass; (iii) under the same circumstance, a larger contact surface results in more severe damage. The GPGPU-parallelised FDEM provides a practical, efficient and robust computational approach in analysing the impact transient dynamic behaviour of glass in 3D.
GPU-Based Concurrent Static Learning Liang, Huaxiao; Lin, Xiaoze; Lai, Liyang ...
2023 IEEE International Test Conference (ITC),
2023-Oct.-7
Conference Proceeding
Static learning is a learning algorithm for retrieving implicit logical relationships between nodes in a netlist. The learning results play an important role in improving automatic test pattern ...generation (ATPG), such as increasing fault coverage and reducing pattern count. In this work, we study accelerating static learning on graphics processing units (GPUs). By tailoring to the architectural features of GPUs, an algorithm of concurrent static learning is proposed. Multiple learning jobs are carried out simultaneously or concurrently in the same netlist. Moreover, the forward and backward implications of these concurrent jobs are processed as a whole, which leads to better utilization of the computing resources on GPUs. Experiments show that the algorithm can achieve up to 253x speedup against a single-threaded commercial tool and is about 1.8 times better than existing GPU-based solutions.
This study introduces the three-dimensional combined finite-discrete element method (3D FDEM) to perform cracking analysis of segmental linings under extreme conditions. Considering the complex ...contact interactions of segments, this study first proposes a polar-based GPGPU-parallelized contact detection algorithm to handle memory issues confronted by existing FDEM algorithms. Initially, a spatial decomposition approach based on the polar coordinate system is implemented during the broad search phase. Each tetrahedral element is positioned within suitable search cells according to the axis-aligned bounding box. Subsequently, element pairs within each search cell are iteratively traversed, and potential contact pairs are identified using the judge cell criteria. After the broad search, a narrow search phase is executed to determine all real contacts. Following the above implementation, a load-structure model encompassing soil spring and external pressure calculations is then proposed. Based on the shape functions and coordinate transformation, the formulations of soil springs and external loads are derived and parallelized in the 3D FDEM framework. Three numerical tests are presented to validate the effectiveness of the proposed approach. Simulation results confirm the suitability of the proposed method for cracking analysis of segmental linings. Compared to the existing methods, it reduces GPU memory usage by 56 ∼ 76 % without extending time, which enhances the computational scale of 3D FDEM simulations. Furthermore, the proposed method is applied to two engineering scenarios, i.e., straight and curved segmental linings, both considering the absence and presence of contact defects between segments. The results show that contact defects can significantly reduce the resistance of the structural system comprising bolts and concrete segments for straight and curved scenarios.
In the past years various methods have been developed to estimate high-resolution solar potential in urban areas, by simulating solar irradiance over surface models that originate from remote sensing ...data. In general, this requires discretisation of solar irradiance models that estimate direct, reflective, and diffuse irradiances. The latter is most accurately estimated by an anisotropic model, where the hemispherical sky dome from arbitrary surface’s viewpoint consists of the horizon, the circumsolar and sky regions. Such model can be modified to incorporate the effects of shadowing from obstruction with a view factor for each sky region. However, state-of-the-art using such models for estimating solar potential in urban areas, only considers the sky view factor, and not circumsolar view factor, due to high computational load. In this paper, a novel parallelisation of solar potential estimation is proposed by using General Purpose computing on Graphics Processing Units (GPGPU). Modified anisotropic Perez model is used by considering diffuse shadowing with all three sky view factors. Moreover, we provide validation based on sensitivity analysis of the method’s accuracy with independent meteorological measurements, by changing circumsolar sky region’s half-angle and resolution of the hemispherical sky dome. Finally, the presented method using GPPGU was compared to multithreaded Central Processing Unit (CPU) approach, where on average a 70x computational speedup was achieved. Finally, the proposed method was applied over a ∼21km2 urban area, obtained from Light Detection And Ranging (LiDAR) data, where the computation of solar potential was performed in a reasonable time.
•A new method for high-resolution solar potential estimation using GPGPU.•Novel incorporation of the modified Perez anisotropic diffuse irradiance model.•The method considers sky, circumsolar, and horizon view factors with shadowing.•In comparison to a multithreaded CPU approach, a 70x speedup can be achieved.