Contemporary data center servers process thousands of similar, independent requests per minute. In the interest of programmer productivity and ease of scaling, workloads in data centers have shifted ...from single monolithic processes toward a micro and nanoservice software architecture. As a result, single servers are now packed with many threads executing the same, relatively small task on different data.
State-of-the-art data centers run these microservices on multi-core CPUs. However, the flexibility offered by traditional CPUs comes at an energy-efficiency cost. The Multiple Instruction Multiple Data execution model misses opportunities to aggregate the similarity in contemporary microservices. We observe that the Single Instruction Multiple Thread execution model, employed by GPUs, provides better thread scaling and has the potential to reduce frontend and memory system energy consumption. However, contemporary GPUs are ill-suited for the latency-sensitive microservice space.
To exploit the similarity in contemporary microservices, while maintaining acceptable latency, we propose the Request Processing Unit (RPU). The RPU combines elements of out-of-order CPUs with lockstep thread aggregation mechanisms found in GPUs to execute microservices in a Single Instruction Multiple Request (SIMR) fashion. To complement the RPU, we also propose a SIMR-aware software stack that uses novel mechanisms to batch requests based on their predicted control-flow, split batches based on predicted latency divergence and map per-request memory allocations to maximize coalescing opportunities. Our resulting RPU system processes 5.7× more requests/joule than multi-core CPUs, while increasing single thread latency by only 1.44×.
Observational studies of high-frequency oscillatory ventilation in adults with the acute respiratory distress syndrome have demonstrated improvements in oxygenation. We designed a multicenter, ...randomized, controlled trial comparing the safety and effectiveness of high-frequency oscillatory ventilation with conventional ventilation in adults with acute respiratory distress syndrome; 148 adults with acute respiratory distress syndrome (Pa(O2)/fraction of inspired oxygen <or= 200 mm Hg on 10 or more cm H2O positive end-expiratory pressure) were randomized to high-frequency oscillatory ventilation (n = 75) or conventional ventilation (n = 73). Applied mean airway pressure was significantly higher in the high-frequency oscillation group compared with the conventional ventilation group throughout the first 72 hours (p = 0.0001). The high-frequency oscillation group showed early (less than 16 hours) improvement in Pa(O2)/fraction of inspired oxygen compared with the conventional ventilation group (p = 0.008); however, this difference did not persist beyond 24 hours. Oxygenation index decreased similarly over the first 72 hours in both groups. Thirty-day mortality was 37% in the high-frequency oscillation group and was 52% in the conventional ventilation group (p = 0.102). The percentage of patients alive without mechanical ventilation at Day 30 was 36% and 31% in the high-frequency oscillation and conventional ventilation groups, respectively (p = 0.686). There were no significant differences in hemodynamic variables, oxygenation failure, ventilation failure, barotraumas, or mucus plugging between treatment groups. We conclude that high-frequency oscillation is a safe and effective mode of ventilation for the treatment of acute respiratory distress syndrome in adults.
Deadline-Aware Offloading for High-Throughput Accelerators Yeh, Tsung Tai; Sinclair, Matthew D.; Beckmann, Bradford M. ...
2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA),
02/2021
Conference Proceeding
Contemporary GPUs are widely used for throughput-oriented data-parallel workloads and increasingly are being considered for latency-sensitive applications in datacenters. Examples include recurrent ...neural network (RNN) inference, network packet processing, and intelligent personal assistants. These data parallel applications have both high throughput demands and real-time deadlines (40μs-7ms). Moreover, the kernels in these applications have relatively few threads that do not fully utilize the device unless a large batch size is used. However, batching forces jobs to wait, which increases their latency, especially when realistic job arrival times are considered.Previously, programmers have managed the tradeoffs associated with concurrent, latency-sensitive jobs by using a combination of GPU streams and advanced scheduling algorithms running on the CPU host. Although GPU streams allow the accelerator to execute multiple jobs concurrently, prior state-of-the-art solutions use the relatively distant CPU host to prioritize the latency-sensitive GPU tasks. Thus, these approaches are forced to operate at a coarse granularity and cannot quickly adapt to rapidly changing program behavior.We observe that fine-grain, device-integrated kernel schedulers efficiently meet the deadlines of concurrent, latency-sensitive GPU jobs. To overcome the limitations of software-only, CPU-side approaches, we extend the GPU queue scheduler to manage real-time deadlines. We propose a novel laxity-aware scheduler (LAX) that uses information collected within the GPU to dynamically vary job priority based on how much laxity jobs have before their deadline. Compared to contemporary GPUs, 3 state-of-the-art CPU-side schedulers and 6 other advanced GPU-side schedulers, LAX meets the deadlines of 1.7X - 5.0X more jobs and provides better energy-efficiency, throughput, and 99-percentile tail latency.
Accel-sim Khairy, Mahmoud; Shen, Zhesheng; Aamodt, Tor M. ...
2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA),
05/2020
Conference Proceeding
Odprti dostop
In computer architecture, significant innovation frequently comes from industry. However, the simulation tools used by industry are often not released for open use, and even when they are, the exact ...details of industrial designs are not disclosed. As a result, research in the architecture space must ensure that assumptions about contemporary processor design remain true.
To help bridge the gap between opaque industrial innovation and public research, we introduce three mechanisms that make it much easier for GPU simulators to keep up with industry. First, we introduce a new GPU simulator frontend that minimizes the effort required to simulate different machine ISAs through trace-driven simulation of NVIDIA's native machine ISA, while still supporting execution-driven simulation of the virtual ISA. Second, we extensively update GPGPU-Sim's performance model to increase its level of detail, configurability and accuracy. Finally, surrounding the new frontend and flexible performance model is an infrastructure that enables quick, detailed validation. A comprehensive set of microbenchmarks and automated correlation plotting ease the modeling process.
We use these three new mechanisms to build Accel-Sim, a detailed simulation framework that decreases cycle error 79 percentage points, over a wide range of 80 workloads, consisting of 1,945 kernel instances. We further demonstrate that Accel-Sim is able to simulate benchmark suites that no other open-source simulator can. In particular, we use Accel-sim to simulate an additional 60 workloads, comprised of 11,440 kernel instances, from the machine learning benchmark suite Deepbench. Deepbench makes use of closed-source, hand-tuned kernels with no virtual ISA implementation. Using a rigorous counter-by-counter analysis, we validate Accel-Sim against contemporary GPUs.
Finally, to highlight the effects of falling behind industry, this paper presents two case-studies that demonstrate how incorrect baseline assumptions can hide new areas of opportunity and lead to potentially incorrect design decisions.
A variable warp size architecture Rogers, Timothy G.; Johnson, Daniel R.; O'Connor, Mike ...
2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA),
06/2015
Conference Proceeding
Odprti dostop
This paper studies the effect of warp sizing and scheduling on performance and efficiency in GPUs. We propose Variable Warp Sizing (VWS) which improves the performance of divergent applications by ...using a small base warp size in the presence of control flow and memory divergence. When appropriate, our proposed technique groups sets of these smaller warps together by ganging their execution in the warp scheduler, improving performance and energy efficiency for regular applications. Warp ganging is necessary to prevent performance degradation on regular workloads due to memory convergence slip, which results from the inability of smaller warps to exploit the same intra-warp memory locality as larger warps. This paper explores the effect of warp sizing on control flow divergence, memory divergence, and locality. For an estimated 5% area cost, our ganged scheduling microarchitecture results in a simulated 35% performance improvement on divergent workloads by allowing smaller groups of threads to proceed independently, and eliminates the performance degradation due to memory convergence slip that is observed when convergent applications are executed with smaller warp sizes.
Cache-Conscious Wavefront Scheduling Rogers, Timothy G.; O'Connor, Mike; Aamodt, Tor M.
2012 45th Annual IEEE/ACM International Symposium on Microarchitecture,
12/2012
Conference Proceeding
Odprti dostop
This paper studies the effects of hardware thread scheduling on cache management in GPUs. We propose Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a ...novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity. In contrast to improvements in the replacement policy that can better tolerate difficult access patterns, CCWS shapes the access pattern to avoid thrashing the shared L1. We show that CCWS can outperform any replacement scheme by evaluating against the Belady-optimal policy. Our evaluation demonstrates that cache efficiency and preservation of intra-wave front locality become more important as GPU computing expands beyond use in high performance computing. At an estimated cost of 0.17% total chip area, CCWS reduces the number of threads actively issued on a core when appropriate. This leads to an average 25% fewer L1 data cache misses which results in a harmonic mean 24% performance improvement over previously proposed scheduling policies across a diverse selection of cache-sensitive workloads.
Highly multithreaded architectures introduce another dimension to fine-grained hardware cache management. The order in which the system's threads issue instructions can significantly impact the ...access stream seen by the caching system. This article studies a set of economically important server applications and presents the cache-conscious wavefront scheduling (CCWS) hardware mechanism, which uses feedback from the memory system to guide the issue-level thread scheduler and shape the access pattern seen by the first-level cache.
OBJECTIVE:Evidence-based practice recommendations abound, but implementation is often unstructured and poorly audited. We assessed the ability of a peer network to implement an evidence-based best ...practice protocol and to measure patient outcomes.
DESIGN:Consensus definition of spontaneous breathing trial followed by implementation in eight academic medical centers.
SETTING:Six medical, two surgical, and two combined medical/surgical adult intensive care units among eight academic medical centers.
STUDY POPULATION:Patients initiating mechanical ventilation through an endotracheal tube during a 12-wk interval formed the study population.
INTERVENTIONS:Adoption and implementation of a common spontaneous breathing trial protocol across multiple intensive care units.
MEASUREMENTS AND MAIN RESULTS:Seven hundred five patients had 3,486 safety screens for conducting a spontaneous breathing trial; 2072 (59%) patients failed the safety screen. Another 379 (11%) patients failed a 2-min tolerance screen and 1,122 (34%) patients had a full 30–120 min spontaneous breathing trial performed. Seventy percent of eligible patients were enrolled. Only 55% of passing spontaneous breathing trials resulted in liberation from mechanical ventilatory support before another spontaneous breathing trial was performed.
CONCLUSIONS:Peer networks can be effective in promoting and implementing evidence-based best practices. Implementation of a best practice (spontaneous breathing trial) may be necessary for, but by itself insufficient to achieve, consistent and timely liberation from ventilator support.
Mitigating GPU Core Partitioning Performance Effects Barnes, Aaron; Shen, Fangjia; Rogers, Timothy G.
2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA),
2023-Feb.
Conference Proceeding
Modern GPU Streaming Multiprocessors (SMs) have several warp schedulers, execution units, and register file banks. To reduce area and energy-consumption, recent generations divide SMs into sub-cores. ...Each sub-core contains a distinct warp scheduler, register file, and execution units, sharing L1 memory and scratchpad resources with sub-cores in the same SM. Although partitioning the SM into sub-cores decreases the area and energy demands of larger SMs, it comes at a performance cost. Warps assigned to the SM have access to a fraction of the SM's resources, resulting in contention and imbalance issues. In this paper, we examine the effect SM sub-division has on performance and propose novel mechanisms to mitigate the negative impacts. We identify four orthogonal effects caused by sub-dividing SMs and demonstrate that two of these effects have a significant impact on performance in practice. Based on these findings, we propose register-bank-aware warp scheduling to avoid bank conflicts that arise when instruction operands are placed in the limited number of register file banks available to each sub-core, and randomly hashed sub-core assignment to mitigate imbalance issues. Our intelligent scheduling mechanisms result in an average 11.2% speedup across a diverse set of applications capturing 81% of the performance lost to SM sub-division.
A SIMT Analyzer for Multi-Threaded CPU Applications Alawneh, Ahmad; Khairy, Mahmoud; Rogers, Timothy G.
2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
Conference Proceeding
The use of GPUs for general purpose applications has drastically increased. However, the performance gain from porting multithreaded CPU workloads to massively parallel SIMT-based accelerators, like ...GPUs, is often unpredictable. Even with enough parallelism, programmers do not know if their CPU code will run well on a GPU without first investing the effort to refactor it into a GPGPU programming language. Most of this unpredictability stems from two key side-effects of the GPU's energy-efficient SIMT hardware: control-flow and memory divergence.To alleviate this issue, we propose SIMTec, an analysis tool that computes the control-flow and memory divergence of arbitrary pre-compiled CPU binaries. The tool constructs and analyzes a dynamic control flow graph of the application, batches threads into warps and emulates the operation of a SIMT stack for each warp to compute the projected SIMT efficiency. Given each warp's execution mask, memory coalescing is computed using the addresses accessed by memory instructions from parallel threads. The tool reports the SIMT efficiency and memory divergence characteristics.We validate SIMTec using a suite of 11 applications with both x86 CPU and CUDA GPU implementations on an NVIDIA Volta V100, demonstrating that SIMTec has a correlation factor of 1.00 and 0.98 for SIMT efficiency and memory divergence, respectively. To demonstrate the predictive power of SIMTec, we explore another 16 CPU workloads for which there is no 1:1 GPU implementation. We perform case studies on these applications that range from compute-intensive thread-parallel workloads to cloud-based request-parallel microservices. Using SIMTec, we demonstrate that many of these CPU-only workloads are amenable to SIMT acceleration as-is.