Understanding GPU Power Bridges, Robert A; Imam, Neena; Mintz, Tiffany M
ACM computing surveys,
12/2016, Letnik:
49, Številka:
3
Journal Article
Recenzirano
Odprti dostop
Modern graphics processing units (GPUs) have complex architectures that admit exceptional performance and energy efficiency for high-throughput applications. Although GPUs consume large amounts of ...power, their use for high-throughput applications facilitate state-of-the-art energy efficiency and performance. Consequently, continued development relies on understanding their power consumption. This work is a survey of GPU power modeling and profiling methods with increased detail on noteworthy efforts. As direct measurement of GPU power is necessary for model evaluation and parameter initiation, internal and external power sensors are discussed. Hardware counters, which are low-level tallies of hardware events, share strong correlation to power use and performance. Statistical correlation between power and performance counters has yielded worthwhile GPU power models, yet the complexity inherent to GPU architectures presents new hurdles for power modeling. Developments and challenges of counter-based GPU power modeling are discussed. Often building on the counter-based models, research efforts for GPU power simulation, which make power predictions from input code and hardware knowledge, provide opportunities for optimization in programming or architectural design. Noteworthy strides in power simulations for GPUs are included along with their performance or functional simulator counterparts when appropriate. Last, possible directions for future research are discussed.
Programmer-guided reliability for extreme-scale applications Bernholdt, David E; Elwasif, Wael R; Kartsaklis, Christos ...
The international journal of high performance computing applications,
09/2018, Letnik:
32, Številka:
5
Journal Article
Recenzirano
Odprti dostop
We present “programmer-guided reliability” (PGR) as a systematic conceptual approach to address the expected rise in soft errors in coming extreme-scale systems at the application level. The approach ...involves instrumentation of the application with code to detect data corruption errors. The location and nature of these error detectors are at the discretion of the programmer, who uses their knowledge and experience with the problem domain, the application, the solution algorithms, etc., to determine the most vulnerable areas of the code and the most appropriate ways to detect data corruption. To illustrate the approach, we provide examples of error detectors from four different benchmark-scale applications. We also describe a simple control framework that allows for runtime configuration of the error detectors without recompilation of the application, as well as dynamic reconfiguration during the execution of the application. Finally, we discuss a number of future directions building on the basic PGR approach, including the incorporation of some general error detectors into the programming environment in order to make them more easily usable by the programmer.
In this paper, we describe an FPGA-based coprocessor architecture that performs a high-throughput branch-and-bound search of the space of phylogenetic trees corresponding to the number of input taxa. ...Our coprocessor architecture is designed to accelerate maximum-parsimony phylogeny reconstruction for gene-order and sequence data and is amenable to both exhaustive and heuristic tree searches. Our architecture exposes coarse-grain parallelism by dividing the search space among parallel processing elements (PEs) and each PE exposes fine-grain memory parallelism for their lower-bound computation, the kernel computation performed by each PE. Inter-PE communication is performed entirely on-chip. When using this coprocessor for maximum-parsimony reconstruction for gene-order data, our coprocessor achieves a 40X improvement over software in search throughput, corresponding to a 14X end-to-end application improvement when including all communication and systems overheads.
QCOR Mintz, Tiffany M.; McCaskey, Alexander J.; Dumitrescu, Eugene F. ...
ACM journal on emerging technologies in computing systems,
04/2020, Letnik:
16, Številka:
2
Journal Article
Recenzirano
Odprti dostop
Quantum computing (QC) is an emerging computational paradigm that leverages the laws of quantum mechanics to perform elementary logic operations. Existing programming models for QC were designed with ...fault-tolerant hardware in mind, envisioning stand-alone applications. However, the susceptibility of near-term quantum computers to noise limits their stand-alone utility. To better leverage limited computational strengths of noisy quantum devices, hybrid algorithms have been suggested whereby quantum computers are used in tandem with their classical counterparts in a heterogeneous fashion. This
modus operandi
calls out for a programming model and a high-level programming language that natively and seamlessly supports heterogeneous quantum-classical hardware architectures in a single-source-code paradigm. Motivated by the lack of such a model, we introduce a language extension specification, called
QCOR
, which enables single-source quantum-classical programming. Programs written using the QCOR library–based language extensions can be compiled to produce functional hybrid binary executables. After defining QCOR’s programming model, memory model, and execution model, we discuss how QCOR enables variational, iterative, and feed-forward QC. QCOR approaches quantum-classical computation in a hardware-agnostic heterogeneous fashion and strives to build on best practices of high-performance computing. The high level of abstraction in the language extension is intended to accelerate the adoption of QC by researchers familiar with classical high-performance computing.
We present "programmer-guided reliability" (PGR) as a systematic conceptual approach to address the expected rise in soft errors in coming extreme-scale systems at the application level. The approach ...involves instrumentation of the application with code to detect data corruption errors. The location and nature of these error detectors are at the discretion of the programmer, who uses their knowledge and experience with the problem domain, the application, the solution algorithms, etc., to determine the most vulnerable areas of the code and the most appropriate ways to detect data corruption. To illustrate the approach, we provide examples of error detectors from four different benchmark-scale applications. We also describe a simple control framework that allows for runtime configuration of the error detectors without recompilation of the application, as well as dynamic reconfiguration during the execution of the application. Finally, we discuss a number of future directions building on the basic PGR approach, including the incorporation of some general error detectors into the programming environment in order to make them more easily usable by the programmer.
This work addresses how to naturally adopt the
l
2
-norm cosine similarity in the neuromemristive system and studies the unsupervised learning performance on handwritten digit image recognition. ...Proposed architecture is a two-layer fully connected neural network with a hard winner-take-all (WTA) learning module. For input layer, we propose single-spike temporal code that transforms input stimuli into the set of single spikes with different latencies and voltage levels. For a synapse model, we employ a compound memristor where stochastically switching binary-state memristors connected in parallel, which offers a reliable and scalable multi-state solution for synaptic weight storage. Hardware-friendly synaptic adaptation mechanism is proposed to realize spike-timing-dependent plasticity learning. Input spikes are sent out through those memristive synapses to each and every integrate-and-fire neuron in the fully connected output layer, where the hard WTA network motif introduces the competition based on cosine similarity for the given input stimuli. Finally, we present 92.64% accuracy performance on unsupervised digit recognition with only single-epoch MNIST dataset training via high-level simulations, including extensive analysis on the impact of system parameters.
In this paper we consider graph algorithms and graphical analysis as a new
application for neuromorphic computing platforms. We demonstrate how the
nonlinear dynamics of spiking neurons can be used ...to implement low-level graph
operations. Our results are hardware agnostic, and we present multiple versions
of routines that can utilize static synapses or require synapse plasticity.
In the heterogeneous computing execution model, one or more general-purpose processors are accelerated using one or more co-processors. In this model, general-purpose CPUs are generally assigned ...portions of the software that either do not map well to the available co-processor microarchitectures or whose low execution time does not warrant the extra effort required to adapt the code to the co-processor’s programming model. The co-processors, on the other hand, are assigned the most computationally expensive portions of the software, and this code is adapted to the co-processor’s specialized programming model. In order for legacy code to take advantage of a heterogeneous computer, a programmer must partition its code to select which portions of it to map to the co-processor. The selection criteria typically involves finding and selecting the application’s most expensive computations, or kernels. However, this methodology only considers execution time while ignoring memory behavior. In heterogeneous systems where the general-purpose processor and co-processor have disjoint memory spaces, there is a penalty required to exchange data between processors, and it is important to minimize communication cost. We refer to this category of heterogeneous systems as “Disjoint Memory Co-Processor Accelerated Computing (DiMCAC).” The partitioning procedure is typically performed in an ad hoc manner due to the lack of existing automation tools designed for this task. The tools that do exist are not specially designed for the DiMCAC model, or require that a programmer perform manual analysis which can take a considerable amount of time when the programmer is not familiar with the application. To address this issue, this research presents a Partitioning Analysis Tool for Heterogeneous Systems (PATHS). PATHS is an analysis toolchain that performs a fully automated behavioral analysis of applications to be partitioned for execution on computing platforms that correspond to the DiMCAC model. In this case, making effective partitioning decisions may require optimizing against competing constraints. In this dissertation, we describe new instrumentation, measurement, presentation and selection components that are implemented in PATHS to support a systematic partitioning methodology. PATHS’ primary contributions are the development of (1) a novel methodology for instrumentation and runtime data collection to monitor execution time and transferred data movement at the loop level, (2) an objective function for determining the fitness of an arbitrary set of assignments, and (3) a heuristic search technique for finding an effective solution. In an experimental evaluation of five different computationally intensive applications, PATHS provides a top ranked candidate accelerator for each application with a fitness evaluation that is higher than candidate accelerators selected from application profiles performed by GNU Gprof.
Quantum computing is an emerging computational paradigm that leverages the laws of quantum mechanics to perform elementary logic operations. Existing programming models for quantum computing were ...designed with fault-tolerant hardware in mind, envisioning standalone applications. However, near-term quantum computers are susceptible to noise which limits their standalone utility. To better leverage limited computational strengths of noisy quantum devices, hybrid algorithms have been suggested whereby quantum computers are used in tandem with their classical counterparts in a heterogeneous fashion. This {\it modus operandi} calls out for a programming model and a high-level programming language that natively and seamlessly supports heterogeneous quantum-classical hardware architectures in a single-source-code paradigm. Motivated by the lack of such a model, we introduce a language extension specification, called QCOR, that enables single-source quantum-classical programming. Programs written using the QCOR library and directives based language extensions can be compiled to produce functional hybrid binary executables. After defining the QCOR's programming model, memory model, and execution model, we discuss how QCOR enables variational, iterative, and feed forward quantum computing. QCOR approaches quantum-classical computation in a hardware-agnostic heterogeneous fashion and strives to build on best practices of high performance computing (HPC). The high level of abstraction in the developed language is intended to accelerate the adoption of quantum computing by researchers familiar with classical HPC.