As the performance-price ratio of the GPU becomes higher, lots of systems are able to accommodate more than one GPU in node. Each GPU in node can afford powerful rendering ability. It is very ...important to effectively organize parallel rendering pipeline to fully exploit the compute units of the system. But lots of parallel rendering systems usually join hardware rendering stage with composition stage in the display thread and this frequently leads to GPU stall. In this paper, we describe a decoupled parallel rendering approach and enable the two stages to execute in parallel. With the frame buffer in the main memory, the full image rendering time is totally decided by the GPU rendering ability when the rendering task is large enough. Theoretical analysis and experiment results both evidence that the performance of our method is much better than the coupled parallel rendering method. We also test the scalability of the approach and get a linear performance speedup with the GPU number when the rendering task is large enough. The approach is easy to be implemented and any parallel rendering application can benefit from it.
In this paper, a novel implementation of the distributed 3D Fast Fourier Transform (FFT) on a multi-GPU platform using CUDA is presented. The 3D FFT is the core of many simulation methods, thus its ...fast calculation is critical. The main bottleneck of the distributed 3D FFT is the global data exchange which must be performed. The latest version of CUDA introduces direct GPU-to-GPU transfers using a Unified Virtual Address space (UVA) that provides new possibilities for optimising the communication part of the FFT. Here, we propose different implementations of the distributed 3D FFT, investigate their behaviour, and compare their performance with the single GPU CUFFT and CPU-based FFTW libraries. In particular, we demonstrate the advantage of direct GPU-to-GPU transfers over data exchanges via host main memory. Our preliminary results show that running the distributed 3D FFT with four GPUs can bring a 12% speedup over the single node (CUFFT) while also enabling the calculation of 3D FFTs of larger datasets. Replacing the global data exchange via shared memory with direct GPU-to-GPU transfers reduces the execution time by up to 49%. This clearly shows that direct GPU-to-GPU transfers are the key factor in obtaining good performance on multi-GPU systems.
This paper analyses several parallel approaches for the development of a physical model of Non-linear ODT for its application in velocimetry techniques. The main benefits of its application in HPIV ...are the high accuracy with non-damaging radiation and its imaging capability to recover information from the vessel wall of the flow. Thus ODT-HPIV is suitable for microfluidic devices and biofluidic applications. Our physical model is based on an iterative method which uses double-precision complex numbers, therefore it has a high computational cost. As a result, High Performance Computing is necessary for both: implementation and validation of the model. Concretely, the model has been parallelized by means of different architectures: shared-memory multiprocessors and graphics processing units (GPU) using the CUDA device.
Auto-Tuning techniques have been used in the design of routines in recent years. The goal is to develop routines which automatically adapt to the conditions of the computational system, in such a way ...that efficient executions are obtained independently of the user's experience. This paper aims to explore programming routines that can be automatically adapted to the computational system conditions, making possible to use Auto-Tuning methodology to represent landform attributes on multicores and multi-GPU systems.
We consider a parallel eigensolver for generalized eigenvalue problems for distributed GPU systems. In this paper, we propose a distributed parallel implementation of the Sakurai-Sugiura (SS) ...eigenvalue solver for solving generalized eigenvalue problems with real symmetric matrices using GPU linear algebra libraries. In the SS method, the target subspace is constructed from solutions of linear systems. The dominant part of this method is calculating solutions of linear equations. By assigning the solution of independent linear systems to each GPU, a coarse-grained parallelism can be obtained, and high scalability is expected. We also proposed the performance model of this implementation. We evaluate its parallel performance using numerical examples that involve medium-size dense matrices.
This paper presents a multi-GPU accelerated fast integral equation solver for efficient modeling of antennas mounted on complex and large platform. The antenna system with surface-wire configuration ...is characterized using surface wire integral equation (SWIE). The SWIE is solved using the Precorrected-FFT (P-FFT) method on multi-GPUs. Special mapping schemes are proposed to efficiently map the P-FTT algorithm to multi-GPU platform. Very good performance achieved validates the proposed multi-GPU mapping schemes.
A dramatic improvement in energy efficiency is mandatory for sustainable supercomputing and has been identified as a major challenge. Affordable energy solution continues to be of great concern in ...the development of the next generation of supercomputers. Low power processors, dynamic control of processor frequency and heterogeneous systems are being proposed to mitigate energy costs. However, the entire software stack must be re-examined with respect to its ability to improve efficiency in terms of energy as well as performance. In order to address this need, a better understanding of the energy behavior of applications is essential. In this paper we explore the energy efficiency of some common kernels used in high performance computing on a multi-GPU platform, and compare our results with multicore CPUs. We implement these kernels using optimized libraries like FFTW, CUBLAS and MKL. Our experiments demonstrate a relationship between energy consumption and computation-communication factors of certain application kernels. In general, we observe that the correlation of energy consumption to GPU global memory accesses is 0.73 and power consumption to operations per unit time is 0.84, signifying a strong positive relationship between them. We believe that our results will assist the HPC community in understanding the power/energy behavior of scientific kernels on multi-GPU platforms.
Offloading computations to multiple GPUs is not an easy task. It requires decomposing data, distributing computations and handling communication manually. Drop-in GPU libraries have made it easy to ...offload computations to multiple GPUs by hiding this complexity inside library calls. Such encapsulation prevents the reuse of the data between successive kernel invocations resulting in redundant communication. This limitation exists in multi-GPU libraries like CUBLASXT. In this paper, we introduce SemCache++, a semantics-aware GPU cache that automatically manages communication between the CPU and multiple GPUs in addition to optimizing communication by eliminating redundant transfers using caching. SemCache++ is used to build the first multi-GPU drop-in replacement library that (a) uses the virtual memory to automatically manage and optimize multi-GPU communication and (b) requires no program rewriting or annotations. Our caching technique is efficient; it uses a two level caching directory to track matrices and sub-matrices. Experimental results show that our system can eliminate redundant communication and deliver significant performance improvements over multi-GPU libraries like CUBLASXT.
A great interest has been given to the Nonnegative Matrix Factorization (NMF) technique due to its ability of extracting highly-interpretable parts from data sets. Gene expression analysis is one of ...the most popular applications of NMF in Bioinformatics. Nonetheless, its usage is hindered by the computational complexity when processing large data sets. In this paper, we present two parallel implementations of NMF. The first version uses CUDA on a Graphics Processing Unit (GPU). Large input matrices are iteratively blockwise transferred and processed. The second implementation distributes data among multiple GPUs synchronized through MPI (Message Passing Interface). When analyzing large data sets with two and four GPUs, it performs respectively, 2.3 and 4.13 times faster than the single-GPU version. This represents about 120 times faster than a conventional CPU. These super linear speedups are achieved when data portions assigned to each GPU are small enough to be transferred only once.