Energy efficiency of GPUs has facilitated the usage of GPUs in many complex scientific applications. Nodes with multi-GPUs along with multi-core CPUs are quite common in today's HPC landscape. This ...gives the flexibility to utilize CPUs or accelerators or even both according to the workload characteristics. It is not possible to measure power and energy accurately in all the cases, an alternate approach is to estimate power and energy using statistical methods. Apart from saving time and money, reasonable prediction of power/energy would lead to power saving optimizations for certain applications, without compromising performance. In this paper we employ parametric and non-parametric regression analysis to model power and energy consumption of some of the common high performance kernels (DGEMM, FFT, PRNG and FD stencils) on a multi-GPU platform. Our experiments show that using a minimal set of hardware counters and performance attributes, the average error between the measured and the predicted values of power and energy is only ~ 4%.
A multi-GPU based semi-Lagrangian fluid solver Liu, Youquan; Shi, Kai; Deng, Heng ...
Proceedings of the 10th International Conference on Virtual Reality Continuum and Its Applications in Industry,
12/2011
Conference Proceeding
This paper presents a multi-GPU platform which can solve the incompressible 2D Navier-Stokes equations efficiently. We design a tile structure to distribute the whole computation domain evenly to ...multi-GPU, each GPU only solves the equations in a limited area but maintaining the whole computation area within its own device memory. Even though the Semi-Lagrangian method may trace the particles across different graphics cards, the memory update does not constitute the bottleneck of the whole performance. During the experiments single, double and triple GTX 260 graphics cards are installed to construct the platform with CUDA 2.3. With a triple GPUs configuration, we can achieve about 270 speedup compared to the according CPU version without including the memory copy cost, and this configuration is also about 2.1 faster than the single GPU version.
Markovian models can generate very large sparse matrices, which are difficult to store and solve. A useful method for finding transient probabilities in Markovian models is the uniformization. The ...aim of this paper is to show that the performance of the uniformization can be improved using multi-GPU architecture. We propose partitioning scheme for HYB sparse matrix storage format and some optimization techniques adjusted so as to minimize communication between GPUs during iterative sparse matrix-vector multiplication, which is the most time consuming step. The results of experiments show that on multi-GPU we can solve larger matrices than on single device and accelerate computations in comparison to a multithreaded CPU. Computational test have been carried out in double precision for a wireless network models. Using multi-GPU we were able to solve model which is described by a matrix of the size 3.6×10 7 .
Support Vector Machine (SVM) is one of the most popular tools for solving general classification and regression problems because of its high predicting accuracy. However, the training phase of ...nonlinear kernel based SVM algorithm is a computationally expensive task, especially for large datasets. In this paper, we propose an intelligent system to solve large classification problems based on parallel SVM. The system utilizes the latest powerful GPU device to improve the speed performance of SVM training and predicting phases. The memory constraint issue brought by large datasets is addressed through either data reduction or data chunking techniques. The complete system includes multiple executable modules and all of them are managed through a main script, which reduces the implementation difficulty and offers platform portability. Empirical results have shown that our system achieves an order of magnitude speed up compared to the classic SVM tool, LIBSVM. The speed performance is further improved to two orders of magnitude by slightly compromising on the predicting accuracy.
Heterogeneous clusters with multiple sockets and multicore-processors accelerated by dedicated coprocessors like GPUs, Cell BE, FPGAs or others nowadays provide unrivaled computing power in terms of ...floating point operations. Specific capabilities of additional processor technologies enable dedicated exploitation with respect to particular application and data characteristics. However, resource utilization, programmability, and scalability of applications across heterogeneous platforms is a major concern. In the framework of the HiFlow finite element software package we have developed a portable software approach that implements efficient parallel solvers for partial differential equations by means of unified and modular user interfaces across a variety of heterogeneous platforms - in particular on GPU accelerated clusters. We detail our concept and provide performance analysis for various test scenarios that prove performance capabilities, scalability, viability, and user friendliness.
Heterogeneous system architectures are becoming more and more of a commodity in the scientific community. While it remains challenging to fully exploit such architectures, the benefits in performance ...and hybrid speed-up, by using a host processor and accelerators in parallel in a non-monolithic matter, are significant. Hereby, the energy efficiency is becoming an increasingly critical challenge for future high-performance computing (HPC) systems, which do want to exceed the Exascale barrier with several competing architecture concepts ranging from high-performance CPUs, combined with GPUs acting as floating-point accelerators, to computationally weak CPUs, paired with dedicated and highly-perform ant FPGA-based accelerators. In this paper, we realize and evaluate a hybrid computing approach based on a two-dimensional seismic streaming algorithm with several heterogeneous system architectures, including conventional HPC approaches based on powerful CPUs and GPUs. Furthermore, we elaborate the effort on an embedded system platform claiming to be a "mini supercomputer" 1. Several CPU and accelerator combinations are utilized in a manual work-sharing manner with the aim of achieving significant performance speed-ups and a detailed energy-efficiency study. Based on roofline models and experimental evaluations, the paper provides an insight into the fact that hybrid computing is mostly unconditionally beneficial for balanced systems regarding the performance as well as the energy efficiency, aiding the programmer in the decision whether or not costly, manually tuned, homogeneous implementations are worthwhile.
Range Query Processing in a Multi-GPU Environment Barrientos, R. J.; Gomez, J. I.; Tenllado, C. ...
2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications,
2012-July
Conference Proceeding
Similarity search has been widely studied in the last years, as it can be applied to several fields such as searching by content in multimedia objects, text retrieval or computational biology. These ...applications usually work on very large databases that are often indexed off-line to enable the acceleration of on-line searches. However, to maintain an acceptable throughput, it is essential to exploit the intrinsic parallelism of the algorithms used for the on-line query solving process, even with indexed databases. Therefore, many strategies have been proposed in the literature to parallelize these algorithms, both on shared and distributed memory multiprocessor systems. Lately, GPUs have also been used to implement brute-force approaches instead of using indexing structures, due to the difficulties introduced by the index in the efficient exploitation of the GPU resources. In this work we propose a Multi-GPU metric-space technique that efficiently exploits index data structures for similarity search in large databases, and show how it outperforms previous OpenMP and GPU brute-force strategies. Furthermore, our analysis covers the effects of the database size and its nature.
Computing and presenting crowd simulation in real-time requires an intensive processing effort, since it is necessary processing the behavior and render of each entity. The advent of GPU computing ...has enabled the development of many strategies for accelerating these simulations. In this paper we propose an architecture for multiples GPUs for crowd simulation, that allows a massive number of entities to be processed and rendered in real time. Also, we implement a representative case-study based on the behavior of a crowd during the street carnival from Rio de Janeiro from which we run benchmarks and compare the benefits achieved using more the presented architecture.
Application programming for GPUs (Graphics Processing Units) is complex and error-prone, because the popular approaches - CUDA and OpenCL - are intrinsically low-level and offer no special support ...for systems consisting of multiple GPUs. The SkelCL library presented in this paper is built on top of the OpenCL standard and offers pre-implemented recurring computation and communication patterns (skeletons) which greatly simplify programming for multi-GPU systems. The library also provides an abstract vector data type and a high-level data (re)distribution mechanism to shield the programmer from the low-level data transfers between the system's main memory and multiple GPUs. In this paper, we focus on the specific support in SkelCL for systems with multiple GPUs and use a real-world application study from the area of medical imaging to demonstrate the reduced programming effort and competitive performance of SkelCL as compared to OpenCL and CUDA. Besides, we illustrate how SkelCL adapts to large-scale, distributed heterogeneous systems in order to simplify their programming.