Alternating Direction Methods of Multipliers (ADMM) has been proven to be a useful alternative to the popular gradient-based optimizers and successfully applied to train the DNN model. Whereas ...existing ADMM-based approaches generally do not achieve a good trade-off between the rapid convergence and fast training and do not support parallel DNN training with multiple GPUs as well. These drawbacks seriously hinder them from effectively training DNN models with modern GPU computing platforms which are always equipped with multiple GPUs. In this paper, we propose pdlADMM that can effectively train DNN in a data-parallel manner. The key insight of pdlADMM lies in that it explores efficient solutions for each sub-problem by comprehensively considering three main factors including computational complexity, convergence, and suitability to parallel computing. With more number of GPUs, pdlADMM remains rapid convergence and the computational complexity on each GPU tends to decline. Extensive experiments demonstrate the effectiveness of our proposal. Compared to the other two state state-of-the-art ADMM-based approaches, pdlADMM converges significantly faster, obtains better accuracy, and achieves very competitive training speed at the same time.
This paper describes the most advanced results obtained in the context of fluid dynamic simulations of high-enthalpy flows using detailed state-to-state air kinetics. Thermochemical non-equilibrium, ...typical of supersonic and hypersonic flows, was modeled by using both the accurate state-to-state approach and the multi-temperature model proposed by Park. The accuracy of the two thermochemical non-equilibrium models was assessed by comparing the results with experimental findings, showing better predictions provided by the state-to-state approach. To overcome the huge computational cost of the state-to-state model, a multiple-nodes GPU implementation, based on an MPI-CUDA approach, was employed and a comprehensive code performance analysis is presented. Both the pure MPI-CPU and the MPI-CUDA implementations exhibit excellent scalability performance. GPUs outperform CPUs computing especially when the state-to-state approach is employed, showing speed-ups, of the single GPU with respect to the single-core CPU, larger than 100 in both the case of one MPI process and multiple MPI process.
This article presents a solution to path tracing of massive scenes on multiple GPUs. Our approach analyzes the memory access pattern of a path tracer and defines how the scene data should be ...distributed across up to 16 GPUs with minimal effect on performance. The key concept is that the parts of the scene that have the highest amount of memory accesses are replicated on all GPUs.
We propose two methods for maximizing the performance of path tracing when working with partially distributed scene data. Both methods work on the memory management level and therefore path tracer data structures do not have to be redesigned, making our approach applicable to other path tracers with only minor changes in their code. As a proof of concept, we have enhanced the open-source Blender Cycles path tracer.
The approach was validated on scenes of sizes up to 169 GB. We show that only 1–5% of the scene data needs to be replicated to all machines for such large scenes. On smaller scenes we have verified that the performance is very close to rendering a fully replicated scene. In terms of scalability we have achieved a parallel efficiency of over 94% using up to 16 GPUs.
•A new multi-GPU-based spectral element formulation for ultrasonic wave propagation.•Novel direct GPU-to-GPU message exchange strategies enabled by CUDA-aware MPI.•Significant accelerations over a ...multi-CPU core-based counterpart formulation.•Achievable accelerations and model sizes scalable with the number of GPUs.•Practicality shown through simulating real-life ultrasonic inspection scenarios.
In this paper, we introduce a new multi-GPU-based spectral element (SE) formulation for simulating ultrasonic wave propagation in solids. To maximize communication efficiency, we purposely developed, based on CUDA-aware MPI, two novel message exchange strategies which allow the common nodal forces of different subdomains to be shared between different GPUs in a direct manner, as opposed to via CPU hosts, during central difference-based time integration steps. The new multi-GPU and CUDA-aware MPI-based formulation is benchmarked against a multi-CPU core and classical MPI-based counterpart, demonstrating a remarkable acceleration in each and every stage of the computation of ultrasonic wave propagation, namely matrix assembly, time integration and message exchange. More importantly, both the computational efficiency and the degree-of-freedom limit of the new formulation are actually scalable with the number of GPUs used, potentially allowing larger structures to be computed and higher computational speeds to be realized. Finally, the new formulation was used to simulate the interaction between Lamb waves and randomly shaped thickness loss defects on plates, showing its potential to become an efficient, accurate and robust technique for addressing the propagation of ultrasonic waves in realistic engineering structures.
We present a multi-GPU design, implementation and performance evaluation of the Halevi-Polyakov-Shoup (HPS) variant of the Fan-Vercauteren (FV) levelled Fully Homomorphic Encryption (FHE) scheme. Our ...design follows a data parallelism approach and uses partitioning methods to distribute the workload in FV primitives evenly across available GPUs. The design is put to address space and runtime requirements of FHE computations. It is also suitable for distributed-memory architectures, and includes efficient GPU-to-GPU data exchange protocols. Moreover, it is user-friendly as user intervention is not required for task decomposition, scheduling or load balancing. We implement and evaluate the performance of our design on two homogeneous and heterogeneous <inline-formula><tex-math notation="LaTeX">{\sf NVIDIA}</tex-math> <mml:math><mml:mi mathvariant="sans-serif">NVIDIA</mml:mi></mml:math><inline-graphic xlink:href="qaisarahmadalbadawi-ieq1-3021238.gif"/> </inline-formula> GPU clusters: <inline-formula><tex-math notation="LaTeX">{\sf K80}</tex-math> <mml:math><mml:mrow><mml:mi mathvariant="sans-serif">K</mml:mi><mml:mn mathvariant="sans-serif">80</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="qaisarahmadalbadawi-ieq2-3021238.gif"/> </inline-formula>, and a customized <inline-formula><tex-math notation="LaTeX">{\sf P100}</tex-math> <mml:math><mml:mrow><mml:mi mathvariant="sans-serif">P</mml:mi><mml:mn mathvariant="sans-serif">100</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="qaisarahmadalbadawi-ieq3-3021238.gif"/> </inline-formula>. We also provide a comparison with a recent shared-memory-based multi-core CPU implementation using two homomorphic circuits as workloads: vector addition and multiplication. Moreover, we use our multi-GPU Levelled-FHE to implement the inference circuit of two Convolutional Neural Networks (CNNs) to perform homomorphically image classification on encrypted images from the <inline-formula><tex-math notation="LaTeX">{\sf MNIST}</tex-math> <mml:math><mml:mi mathvariant="sans-serif">MNIST</mml:mi></mml:math><inline-graphic xlink:href="qaisarahmadalbadawi-ieq4-3021238.gif"/> </inline-formula> and <inline-formula><tex-math notation="LaTeX">{\sf CIFAR-10}</tex-math> <mml:math><mml:mrow><mml:mi mathvariant="sans-serif">CIFAR</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="sans-serif">10</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="qaisarahmadalbadawi-ieq5-3021238.gif"/> </inline-formula> datasets. Our implementation provides 1 to 3 orders of magnitude speedup compared with the CPU implementation on vector operations. In terms of scalability, our design shows reasonable scalability curves when the GPUs are fully connected.
For the rapid growth computation requirements in big data and artificial intelligence area, CPU-GPU heterogeneous clusters can provide more powerful computing capacity compared to CPU clusters. The ...high parallel computing capabilities of GPUs greatly accelerate computation-intensive applications. And the number of GPUs on single computing node is scalable, which greatly improves the computing capacity of the cluster under the condition of limited cluster size. However, there is a lack of the effective load-balancing scheduling model in multi-GPU hardware environment. This article proposes AEML, an acceleration engine for multi-GPU load-balancing in distributed heterogeneous environment. AEML can effectively integrate GPUs into distributed processing framework and achieve great load-balance among multiple heterogeneous GPUs. We propose a heterogeneous task execution model based on multiple GPUs and multiple streams (MGMS), which can effectively balance the workload of multiple GPUs. MGMS model utilizes four core techniques: a fine-grained task mapping mechanism, a device resource unified management scheme, a novel resource-aware GPU task scheduling strategy, and a feedback-based streams adjustment scheme. The implementation of AEML system is based on Spark 3.0.0 and NVIDIA CUDA 10.0. We comprehensively evaluate the performance of AEML with multiple typical benchmarks. Experimental results show that AEML can fully exploit the computing power of GPUs and achieve great load-balance among multiple heterogeneous GPUs.
The acceleration of CFD solvers on GPUs allows for large speedups, which enable more complex and detailed simulations. However, not all numerical methods can be efficiently executed in a massively ...parallel fashion. The Runge–Kutta (RK) and Lower-Upper Symmetric Gauss–Seidel (LU-SGS) time-integration methods are widely used on GPUs and multicore CPUs, respectively; however, the RK method suffers from low convergence speed, while the LU-SGS method is not adapted to a many-core environment. In this paper, we propose to accelerate the matrix-free version of the Data-Parallel Lower-Upper Relaxation (DP-LUR) time-integration method on the unstructured solver FaSTAR, developed by JAXA. The DP-LUR method outperformed the RK and LU-SGS methods on an NVIDIA Tesla V100 GPU when tested on the ONERA M6 and NASA CRM geometries, and reached up to 3.83% of the performance peak of one device.
•DP-LUR time integration method for unstructured meshes is accelerated on GPU.•The DP-LUR method is faster compared to LU-SGS and Runge–Kutta on single and multi GPU.•The DP-LUR method is slower than LU-SGS and faster than Runge–Kutta on CPU.•The DP-LUR method is faster on single and multi GPU than LU-SGS on CPU and Runge–Kutta on single or multi GPU.•The DP-LUR method reaches 3.83% of single GPU peak performance.
GPU computing dapat mempersingkat waktu eksekusi suatu aplikasi yang sebelumnya dianggap memakan waktu eksekusi yang lama dengan memanfaatkan pemrograman paralel. Penerapan multi-GPU memungkinkan ...untuk meningkatkan performa lebih lanjut dalam menjalankan pemrograman paralel tersebut. Namun, tidak semua program berlaku demikian. Salah satu aplikasi yang membutuhkan pemrograman secara paralel adalah image denoising. Untuk melihat apakah apakah penggunaan multi-GPU dapat menambah performa program image denoising, analisis dari pembatasan jumlah thread terhadap waktu eksekusi, penggunaan daya, dan memori GPU dapat diterapkan pada single GPU. Apabila pembatasan thread menurunkan performa program image denoising secara signifikan, maka penggunaan multi-GPU menjadi kandidat untuk meningkatkan performa lebih tinggi. Dalam penelitian ini, perancangan pembatasn thread diimplementasikan dengan proses looping pada bagian kernel program image denoising yang menggunakan filter KNN (K-Nearest Neighbors) dan NLM (Non Local Means). Hasil penelitian ini menunjukan pada penggunaan thread di bawah 5% pada filter KNN dan thread 0,01% pada filter NLM menghasilkan konsumsi daya dan waktu eksekusi yang optimal. Selain itu, pada penggunaan thread di atas 5% pada filter KNN maupun NLM tidak ditemukan perubahan performa yang signifikan, sehingga penambahan performa pada program image denoising akan minimal dengan penerapan sistem multi GPU.
A multi-GPGPU development for Mesoscale Simulations using the Dissipative Particle Dynamics method is presented. This distributed GPU acceleration development is an extension of the DL_MESO package ...to MPI+CUDA in order to exploit the computational power of the latest NVIDIA cards on hybrid CPU–GPU architectures. Details about the extensively applicable algorithm implementation and memory coalescing data structures are presented. The key algorithms’ optimizations for the nearest-neighbour list searching of particle pairs for short range forces, exchange of data and overlapping between computation and communications are also given. We have carried out strong and weak scaling performance analyses with up to 4096 GPUs. A two phase mixture separation test case with 1.8 billion particles has been run on the Piz Daint supercomputer from the Swiss National Supercomputer Center. With CUDA aware MPI, proper GPU affinity, communication and computation overlap optimizations for multi-GPU version, the final optimization results demonstrated more than 94% efficiency for weak scaling and more than 80% efficiency for strong scaling. As far as we know, this is the first report in the literature of DPD simulations being run on this large number of GPUs. The remaining challenges and future work are also discussed at the end of the paper.
Emerging graph neural networks (GNNs) have extended the successes of deep learning techniques against datasets like images and texts to more complex graph-structured data. By leveraging GPU ...accelerators, existing frameworks combine mini-batch and sampling for effective and efficient model training on large graphs. However, this setup faces a scalability issue since loading rich vertex features from CPU to GPU through a limited bandwidth link usually dominates the training cycle. In this article, we propose PaGraph, a novel, efficient data loader that supports general and efficient sampling-based GNN training on single-server with multi-GPU. PaGraph significantly reduces the data loading time by exploiting available GPU resources to keep frequently-accessed graph data with a cache. It also embodies a lightweight yet effective caching policy that takes into account graph structural information and data access patterns of sampling-based GNN training simultaneously. Furthermore, to scale out on multiple GPUs, PaGraph develops a fast GNN-computation-aware partition algorithm to avoid cross-partition access during data-parallel training and achieves better cache efficiency. Finally, it overlaps data loading and GNN computation for further hiding loading costs. Evaluations on two representative GNN models, GCN and GraphSAGE, using two sampling methods, Neighbor and Layer-wise, show that PaGraph could eliminate the data loading time from the GNN training pipeline, and achieve up to 4.8× performance speedup over the state-of-the-art baselines. Together with preprocessing optimization, PaGraph further delivers up to 16.0× end-to-end speedup.