Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the their hardware, contemporary workloads spawn work to the GPU in bulk by ...launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU.
GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain less than 512 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism.
This article presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. This level of control enables the GPU to keep scheduling and executing tasks as long as free warps are found, dramatically reducing underutilization. Experimental results on real hardware demonstrate that Pagoda achieves a geometric mean speedup of 5.52X over PThreads running on a 20-core CPU, 1.76X over CUDA-HyperQ, and 1.44X over GeMTC, the state-of-the-art runtime GPU task scheduling system.
Ab initio study of the toluene dimer Rogers, David M.; Hirst, Jonathan D.; Lee, Edmond P.F. ...
Chemical physics letters,
08/2006, Letnik:
427, Številka:
4
Journal Article
Recenzirano
Non-counterpoise-corrected MP2/6-31++G
∗∗ optimized geometries of the three toluene dimers. The CCSD(T) energies suggest the energy ordering is unchanged, with the lowest-energy isomer on the left.
...We study different conformers of the toluene dimer using unconstrained geometry optimizations at the MP2 level of theory. We reoptimize these employing counterpoise-corrected MP2 gradients, and subsequently perform single-point counterpoise-corrected CCSD(T) interaction energy calculations. An antiparallel-stacked structure is found to be the most stable of the three isomers and has an interaction energy that is narrowly below that of a cross structure; a parallel-stacked structure is the least stable of the three isomers. We find no evidence for a stable T-shaped isomer, that is, no minimum on the potential energy surface corresponding to this structure.
Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, contemporary workloads spawn work to the GPU in bulk by launching ...large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain less than 512 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. Recognizing the issue, CUDA now allows 32 simultaneous tasks on GPUs; however, that still leaves significant room for underutilization. This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. This level of control enables the GPU to keep scheduling and executing tasks as long as free warps are found, dramatically reducing underutilization. Experimental results on real hardware demonstrate that Pagoda achieves a geometric mean speedup of 2.44x over PThreads running on a 20-core CPU, 1.43x over CUDA-HyperQ, and 1.33x over GeMTC, the state-of-the-art runtime GPU task scheduling system.
Contemporary Graphics Processing Units (GPUs) are used to accelerate highly parallel compute workloads. For the last decade, researchers in academia and industry have used cycle-level GPU ...architecture simulators to evaluate future designs. This paper performs an in-depth analysis of commonly accepted GPU simulation methodology, examining the effect both the workload and the choice of instruction set architecture have on the accuracy of a widely-used simulation infrastructure, GPGPU-Sim. We analyze numerous aspects of the architecture, validating the simulation results against real hardware. Based on a characterized set of over 1700 GPU kernels, we demonstrate that while the relative accuracy of compute-intensive workloads is high, inaccuracies in modeling the memory system result in much higher error when memory performance is critical. We then perform a case study using a recently proposed GPU architecture modification, demonstrating that the cross-product of workload characteristics and instruction set architecture choice can have an affect on the predicted efficacy of the technique.
Contemporary Graphics Processing Units (GPUs) are used to accelerate highly parallel compute workloads. For the last decade, researchers in academia and industry have used cycle-level GPU ...architecture simulators to evaluate future designs. This paper performs an in-depth analysis of commonly accepted GPU simulation methodology, examining the effect both the workload and the choice of instruction set architecture have on the accuracy of a widely-used simulation infrastructure, GPGPU-Sim. We analyze numerous aspects of the architecture, validating the simulation results against real hardware. Based on a characterized set of over 1700 GPU kernels, we demonstrate that while the relative accuracy of compute-intensive workloads is high, inaccuracies in modeling the memory system result in much higher error when memory performance is critical. We then perform a case study using a recently proposed GPU architecture modification, demonstrating that the cross-product of workload characteristics and instruction set architecture choice can have an affect on the predicted efficacy of the technique.
Pagoda Yeh, Tsung Tai; Sabne, Amit; Sakdhnagool, Putt ...
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,
01/2017
Conference Proceeding
Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching large tasks, ...where each task is a kernel that contains thousands of threads that occupy the entire GPU.
GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain < 500 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism.
This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. Experimental results demonstrate that Pagoda achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.
Judging a type by its pointer: optimizing GPU virtual functions Zhang, Mengchi; Alawneh, Ahmad; Rogers, Timothy G.
Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,
04/2021
Conference Proceeding
Odprti dostop
Programmable accelerators aim to provide the flexibility of traditional CPUs with significantly improved performance. A well-known impediment to the widespread adoption of programmable accelerators, ...like GPUs, is the software engineering overhead involved in porting the code. Existing support for C++ on GPUs allows programmers to port polymorphic code with little effort. However, the overhead from the virtual functions introduced by polymorphic code has not been well studied or mitigated on GPUs.
To alleviate the performance cost of virtual functions, we propose two novel techniques that determine an object’s type based only on the object’s address, without accessing the object’s embedded virtual table pointer. The first technique, Coordinated Object Allocation and function Lookup (COAL), is a software-only solution that allocates objects by type and uses the compiler and runtime to find the object’s vTable without accessing an embedded pointer. COAL improves performance by 80%, 47%, and 6% over contemporary CUDA, prior research, and our newly-proposed type-based allocator, respectively. The second solution, TypePointer, introduces a hardware modification that allows unused bits in the object pointer to encode the object’s type, improving performance by 90%, 56%, and 12% over CUDA, prior work, and our new allocator. TypePointer can also be used with the default CUDA allocator to achieve an 18% performance improvement without modifying object allocation.
Characterizing Massively Parallel Polymorphism Zhang, Mengchi; Alawneh, Ahmad; Rogers, Timothy G.
2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS),
2021-March
Conference Proceeding
GPU computing has matured to include advanced C++ programming features. As a result, complex applications can potentially benefit from the continued performance improvements made to contemporary GPUs ...with each new generation. Tighter integration between the CPU and GPU, including a shared virtual memory space, increases the usability of productive programming paradigms traditionally reserved for CPUs, like object-oriented programming. Programmers are no longer forced to restructure both their code and data for GPU acceleration. However, the implementation and performance implications of advanced C++ on massively multithreaded accelerators have not been well studied. In this paper, we study the effects of runtime polymorphism on GPUs. We first detail the implementation of virtual function calls in contemporary GPUs using microbenchmarking. We then propose Parapoly, the first open-source polymorphic GPU benchmark suite. Using Parapoly, we further characterize the overhead caused by executing dynamic dispatch on GPUs using massively scaled CPU workloads. Our characterization demonstrates that the optimization space for runtime polymorphism on GPUs is fundamentally different than for CPUs. Where indirect branch prediction and ILP extraction strategies have dominated the work on CPU polymorphism, GPUs are fundamentally limited by excessive memory system contention caused by virtual function lookup and register spilling. Using the results of our study, we enumerate several pitfalls when writing polymorphic code for GPUs and suggest several new areas of system and architecture research that can help alleviate overhead.
Machine learning (ML) has recently emerged as an important application driving future architecture design. Traditionally, architecture research has used detailed simulators to model and measure the ...impact of proposed changes. However, current open-source, publicly available simulators lack support for running a full ML stack like PyTorch. High-confidence, cycle-accurate simulations are crucial for architecture research and without them, it is difficult to rapidly prototype new ideas. In this paper, we describe changes we made to GPGPU-Sim, a popular, widely used GPU simulator, to run ML applications that use cuDNN and PyTorch, two widely used frameworks for running Deep Neural Networks (DNNs). This work has the potential to enable significant microarchitectural research into GPUs for DNNs. Our results show that the modified simulator, which has been made publicly available with this paper 1 Source code available at https://github.com/gpgpu-sim/gpgpu-sim_distribution (dev branch), provides execution time results within 18% of real hardware. We further use it to study other ML workloads and demonstrate how the simulator identifies opportunities for architectural optimization that prior tools are unable to provide.