HPC@Green IT Gruber, Ralf; Strohmaier, Erich; Keller, Vincent
2010, 20091116, 2010-01-31
eBook
The authors of this book offer guidelines on how to improve existing applications in a company with the goal of reducing computer energy consumption. Examples on how to optimize algorithms on single ...node and parallel RISC architectures are discussed.
The memory wall between the peak performance of microprocessors and their memory performance has become the prominent performance bottleneck for many scientific application codes. New benchmarks ...measuring data access speeds locally and globally in a variety of different ways are needed to explore the ever increasing diversity of architectures for high‐performance computing. In this paper, we introduce a novel benchmark, APEX‐Map, which focuses on global data movement and measures how fast global data can be fed into computational units. APEX‐Map is a parameterized, synthetic performance probe and integrates concepts for temporal and spatial locality into its design. Our first parallel implementation in MPI and various results obtained with it are discussed in detail. By measuring the APEX‐Map performance with parameter sweeps for a whole range of temporal and spatial localities performance surfaces can be generated. These surfaces are ideally suited to study the characteristics of the computational platforms and are useful for performance comparison. Results on a global‐memory vector platform and distributed‐memory superscalar platforms clearly reflect the design differences between these different architectures. Published in 2007 by John Wiley & Sons, Ltd.
In a recent paper, Gordon Bell and Jim Gray (2002) put forth a view of the past, present, and future of high-performance computing (HPC) that is both insightful and thought provoking. Identifying key ...trends with a grace and candor rarely encountered in a single work, the authors describe an evolutionary past drawn from their vast experience and project an enticing and compelling vision of HPC's future. Yet, the underlying assumptions implicit in their treatment, particularly those related to terminology and dominant trends, conflict with our own experience, common practices, and shared view of HPCs future directions. Taken from our vantage points of the Top500 list," the Lawrence Berkeley National Laboratory NERSC computer center, Beowulf-class computing, and research in petaflops-scale computing architectures, we offer an alternate perspective on several key issues in the form of a constructive counterpoint. One objective of this article is to restore the strength and value of the term "cluster" by degeneralizing its applicability to a restricted subset of parallel computers. We'll further consider this class in conjunction with its complementing terms constellation, Beowulf class, and massively parallel processing systems (MPPs), based on the classification used by the Top500 list, which has tracked the HPC field for more than a decade.
Automating the workload characterization process is increasingly important in hardware design. Although compiler tools can automatically collect profiling data and predict performance behaviors, the ...process has to be repeated for each potential design. Such challenge is exacerbated by the fast growing body of applications and input problems.We propose an alternative approach based on code similarity learning. The application is decomposed into small kernels that can be mapped to known patterns. The behaviors of a pattern on a hardware setup can be reused. To enable this technology, we propose a new code representation and similarity metric. We automate the detection process using compiler and ML methods. Specifically, we reformulate application's dataflow graphs so that they can be compared based on both compute and data movement. We show this representation can distinguish kernels in the HPCG benchmark and help suggest optimal configurations for SpMV and GEMM hardware accelerators.
Characterizing a memory reference stream using reuse distance distribution can enable predicting the performance on a given architecture. Benchmarks can subject an architecture to a limited set of ...reuse distance distributions, but it cannot exhaustively test it. In contrast, Apex-Map, a synthetic memory probe with parameterized locality, can provide a better coverage of the machine use scenarios. Unfortunately, it requires a lot of expertise to relate an application memory behavior to an Apex-Map parameter set. In this work we present a mathematical formulation that describes the relation between Apex-Map and reuse distance distributions. We also introduce a process through which we can automate the estimation of Apex-Map locality parameters for a given application. This process finds the best parameters for Apex-Map probes that generate a reuse distance distribution similar to that of the original application. We tested this scheme on benchmarks from Scalable Synthetic Compact Applications and Unbalanced Tree Search, and we show that this scheme provides an accurate Apex-Map parameterization with a small percentage of mismatch in reuse distance distributions, about 3% in average and less than 8% in the worst case, on the tested applications.
In this paper we analyze major recent trends and changes in the High Performance Computing (HPC) market place. The introduction of vector computers started the area of ‘Supercomputing’. The initial ...success of vector computers in the seventies was driven by raw performance. Massive parallel systems (MPP) became successful in the early nineties due to their better price/performance ratios, which was enabled by the attack of the ‘killer-micros’. The success of microprocessor based on the shared memory concept (referred to as symmetric multiprocessors (SMP)) even for the very high-end systems, was the basis for the emerging cluster concepts in the early 2000s. Within the first half of this decade clusters of PC’s and workstations have become the prevalent architecture for many HPC application areas on all ranges of performance. However, the Earth Simulator vector system demonstrated that many scientific applications could benefit greatly from other computer architectures. At the same time there is renewed broad interest in the scientific HPC community for new hardware architectures and new programming paradigms. The IBM BlueGene/L system is one early example of a shifting design focus for large-scale system. The DARPA HPCS program has the declared goal of building a Petaflops computer system by the end of the decade using novel computer architectures.
Scientific workflows are increasingly transferring large amounts of data between high performance computing (HPC) systems. Even though these HPC systems are connected via high-speed dedicated ...networks and use dedicated data transfer nodes (DTNs), it is still difficult to predict the data transfer throughput because of variations in data transfer protocols, host configurations, performance of file systems, and overlapping workloads. In order to provide reliable performance prediction for better resource management and job scheduling, we need models for predicting data transfer throughput under real-world conditions. In this paper, we explore different machine learning approaches for building data-driven models to improve performance and prediction of large-scale data transfer throughput. In addition to the variables already collected by the network monitoring system, we also develop heuristics to derive additional metrics for improving the prediction accuracy. We use the prediction results to identify the importance of different network parameters in predicting the throughput for large-scale data transfers. Through extensive tests, we identify key network parameters, discover interesting variations among different HPC sites, and show that we can predict throughput with high accuracy. We also analyze our models and results to provide recommendations for improving the performance of big data transfers.
High energy colliders are essential to study the inner structure of nuclear and
elementary particles. A parallel particle simulation code, BeamBeam3D, has been
developed and actively used to model ...the beam dynamics and to optimize the
performance of these colliders. In this paper, we analyze the performance
characteristics of BeamBeam3D on several considerably different leading high
performance computing architectures that are being employed either as production or
research platforms. We examine different performance questions such as: how the
workload should be partitioned among the processors to effectively use the computing
resources; whether these platforms exhibit similar performance bottlenecks and how
to address them; whether some platforms perform substantially better than others;
and finally, what are the implications of BeamBeam3D for the design of the next
generation supercomputer architectures.