Offloading computations to multiple GPUs is not an easy task. It requires decomposing data, distributing computations and handling communication manually. Drop-in GPU libraries have made it easy to ...offload computations to multiple GPUs by hiding this complexity inside library calls. Such encapsulation prevents the reuse of the data between successive kernel invocations resulting in redundant communication. This limitation exists in multi-GPU libraries like CUBLASXT. In this paper, we introduce SemCache++, a semantics-aware GPU cache that automatically manages communication between the CPU and multiple GPUs in addition to optimizing communication by eliminating redundant transfers using caching. SemCache++ is used to build the first multi-GPU drop-in replacement library that (a) uses the virtual memory to automatically manage and optimize multi-GPU communication and (b) requires no program rewriting or annotations. Our caching technique is efficient; it uses a two level caching directory to track matrices and sub-matrices. Experimental results show that our system can eliminate redundant communication and deliver significant performance improvements over multi-GPU libraries like CUBLASXT.
A great interest has been given to the Nonnegative Matrix Factorization (NMF) technique due to its ability of extracting highly-interpretable parts from data sets. Gene expression analysis is one of ...the most popular applications of NMF in Bioinformatics. Nonetheless, its usage is hindered by the computational complexity when processing large data sets. In this paper, we present two parallel implementations of NMF. The first version uses CUDA on a Graphics Processing Unit (GPU). Large input matrices are iteratively blockwise transferred and processed. The second implementation distributes data among multiple GPUs synchronized through MPI (Message Passing Interface). When analyzing large data sets with two and four GPUs, it performs respectively, 2.3 and 4.13 times faster than the single-GPU version. This represents about 120 times faster than a conventional CPU. These super linear speedups are achieved when data portions assigned to each GPU are small enough to be transferred only once.
Solving exactly Combinatorial Optimization Problems (COPs) using a Branch-and-Bound (B&B) algorithm requires a huge amount of computational resources. Therefore, we recently investigated designing ...B&B algorithms on top of graphics processing units (GPUs) using a parallel bounding model. The proposed model assumes parallelizing the evaluation of the lower bounds on pools of sub-problems. The results demonstrated that the size of the evaluated pool has a significant impact on the performance of B&B and that it depends strongly on the problem instance being solved. In this paper, we design an adaptative parallel B&B algorithm for solving permutation-based combinatorial optimization problems such as FSP (Flow-shop Scheduling Problem) on GPU accelerators. To do so, we propose a dynamic heuristic for parameter auto-tuning at runtime. Another challenge of this pioneering work 1 is to exploit larger degrees of parallelism by using the combined computational power of multiple GPU devices. The approach has been applied to the permutation flow-shop problem. Extensive experiments have been carried out on well-known FSP benchmarks using an Nvidia Tesla S1070 Computing System equipped with two Tesla T10 GPUs. Compared to a CPU-based execution, accelerations up to ×105 are achieved for large problem instances.
The communication overhead is one of the main challenges in the exascale era, where millions of compute cores are expected to collaborate on solving complex jobs. However, many algorithms will not ...scale since they require complex global communication and synchronisation. In order to perform the communication as fast as possible, contentions, blocking and deadlock must be avoided. Recently, we have developed an evolutionary tool producing fast and safe communication schedules reaching the lower bound of the theoretical time complexity. Unfortunately, the execution time associated with the evolution process raises up to tens of hours, even when being run on a multi-core processor. In this paper, we propose a revised implementation accelerated by a single Graphic Processing Unit (GPU) delivering speed-up of 5 compared to a quad-core CPU. Subsequently, we introduce an extended version employing up to 8 GPUs in a shared memory environment offering a speed-up of almost 30. This significantly extends the range of interconnection topologies we can cover.
Due to its high performance/cost ratio, a single PC equipped with multi-GPU is an attractive platform for large scale scene rendering and visualization. In this paper, we present a compositeless ...parallel rendering algorithm on shared memory multi-GPU system. Our algorithm is based on hybrid sort-first and sort-last render mode. By utilizing the DMA asynchronous transfer in modern video cards, we implement asynchronous image read back and implicit image compositing. Using the compositeless algorithm, we totally remove image compositing stage in parallel rendering contrasting with the traditional parallel rendering methods. The theoretical analysis and experiments demonstrate that our algorithm is practical and scalable for large scale scene rendering and high-resolution display.
Recently, the Graphic Processor Unit (GPU) has evolved into a highly parallel, multithreaded, many-core processor with tremendous computational horsepower and very high memory bandwidth. To improve ...the simulation efficiency of complex flow phenomena in the field of computational fluid dynamics, a CUDA-based simulation algorithm of large eddy simulation using multiple GPUs is proposed. Our implementation adopted the "collision after propagation" scheme and performed the propagation process by global memory reading transactions. The working set is split up into equal sub-domains and assigned to each GPU for simplicity. Using recently released hardware, up to four GPUs can be controlled by a single CPU thread and run in parallel. The results show that our multi-GPU implementation could perform simulations on a rather large scale (meshes: 10240×10240) even using double-precision floating point calculation and achieved 190X speedup over the sequential implementation on CPU.
For the efficient simulation of fluid flows governed by a wide range of scales a wavelet-based adaptive multi-resolution solver on heterogeneous parallel architectures is proposed for computational ...fluid dynamics. Both data- and task-based parallelisms are used for multi-core and multi-GPU architectures to optimize the efficiency of a high-order wavelet-based multi-resolution adaptative scheme with a 6th-order adaptive central-upwind weighted essentially non-oscillatory scheme for discretization of the governing equations. A modified grid-block data structure and a new boundary reconstruction method are introduced. A new approach for detecting small scales without using buffer levels is introduced to obtain additional speed-up by minimizing the number of required blocks. Validation simulations are performed for a double-Mach reflection with different refinement criteria. The simulations demonstrate accuracy and computational performance of the solver.
Dense Dynamic Programming on Multi GPU Boyer, Vincent; El Baz, Didier; Elkihel, Moussa
2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing,
2011-Feb.
Conference Proceeding
Odprti dostop
The implementation via CUDA of a hybrid dense dynamic programming method for knapsack problems on amulti-GPU architecture is considered. Tests are carried out on a Bull cluster with Tesla S1070 ...computing systems. A first series of computational results shows substantial speedup. The speedup factor is close to 28 with two GPUs.
Due to its interactive and high quality rendering abilities, GPU ray-casting volume rendering method is very popular for the post-processing of scientific and engineering computing appliances. This ...method however is likely suffered from memory effect, for it will cause the algorithm failure when facing the big data appliances. This problem can be solved through massively parallel approaches. But on the other hand, the complex architecture of the current massively parallel machine environment leads to the more difficulty in the implementation of algorithms with adaptability and parallel scalability. Caused by the dual complexity of computing environments and software architecture, the development difficulty of high-performance algorithms is rapidly rising from now on. In this paper, we presented a distributed multi-node GPU accelerated parallel rendering scheme for seamless coupling low-level computing environments and high-level visualization software. Experiment results show that our scheme can offer stable and efficient run-time support for our multi-GPU ray casting volume render in visualization cluster. When using 8 multi-nodes GPU to visualize 17GB scientific data in a single time-step, the interactive high quality volume rendering only needs less than one second per frame. The results are one order of magnitude faster than the traditional parallel ray casting method run on 512 processor cores.