Data-level parallelism is frequently ignored or underutilized. Achieved through vector/SIMD capabilities, it can provide substantial performance improvements on top of widely used techniques such as ...thread-level parallelism. However, manual vectorization is a tedious and costly process that needs to be repeated for each specific instruction set or register size. In addition, automatic compiler vectorization is susceptible to code complexity, and usually limited due to data and control dependencies. To address some of these issues, Arm recently released a new vector ISA, the scalable vector extension (SVE), which is vector-length agnostic (VLA). VLA enables the generation of binary files that run regardless of the physical vector register length. In this paper, we leverage the main characteristics of SVE to implement and optimize stencil computations, ubiquitous in scientific computing. We show that SVE enables easy deployment of textbook optimizations like loop unrolling, loop fusion, load trading or data reuse. Our detailed simulations using vector lengths ranging from 128 to 2048 bits show that these optimizations can lead to performance improvements over straightforward vectorized code of up to 1.57
×
. In addition, we show that certain optimizations can hurt performance due to reduced arithmetic intensity and instruction overheads, and provide insight useful for compiler optimizers.
Dynamic Tolerance Region Computing for Multimedia Alvarez Martinez, Carlos; Corbal San Adrian, Jesus; Cortes, Mateo Valero
IEEE transactions on computers,
05/2012, Letnik:
61, Številka:
5
Journal Article
Recenzirano
Multimedia computation has clearly become a primary and demanding application segment for new architectures targeted at portable devices. The main challenge for such architectures is to keep pace ...with the computational requirements of ever evolving media standards and applications while satisfying the power and energy consumption required to leverage smaller form-factors and longer battery lifetimes. One technique aimed at reducing both the energy consumption and the execution time of an application is Reuse. This technique memorizes the outcome of an instruction or set of instructions so that we can reuse it the next time we perform the same operation with the same inputs. In this paper, we analyze a region reuse schema specially focused on multimedia applications. While the technique appears to be, in theory, a promising vehicle to improve both timing and energy for low-end media applications, we will show that the extra hardware cost required becomes a severe shortcoming as we find the undesirable situation where we have to consume more energy in order to reduce the execution time (hence becoming a poor power-oriented solution). To mitigate the overhead of the reuse hardware, we advocate for exploiting a third variable in the power-time trade-off and we evaluate tolerant region reuse, a technique that relies in the tolerance in the output precision of media algorithms to improve reuse. With this technique, we afford to use less consuming hardware structures that drives benefits in both energy and timing. As a trade-off, tolerant region reuse introduces non-noticeable errors in the output data. The main drawback of tolerant region reuse is the strong reliance on application profiling, the need for careful tuning from the application developer, and the inability of the technique to adapt to the variability of the media contents being used as inputs. To address that inflexibility, we introduce dynamic tolerant region reuse. This novel technique overcomes the drawbacks of tolerant region reuse by allowing the hardware to study the precision quality of the region reuse output. Our mechanism allows the programmer to grant a minimum threshold on signal-to-noise ratio (SNR) while letting the technique adapt to the characteristics of the specific application and workload to minimize time and energy consumption. This leads to greater energy-delay savings while keeps output error below noticeable levels, avoiding at the same time the need of profiling. We studied our mechanism applied to a set of three different processors, from low to high end. As we will show our technique leads to consistent performance improvements in all of our benchmark programs while reducing energy consumption. We can report savings up to 30 percent in the energy*delay factor for all three processors.
To reach its potential, the Internet of Things (IoT) must break down the silos that limit applications' interoperability and hinder their manageability. Doing so leads to the building of ...ultra-large-scale systems (ULSS) in several areas, including autonomous vehicles, smart cities, and smart grids. The scope of ULSS is both large and complex. Thus, the authors propose Hierarchical Emergent Behaviors (HEB), a paradigm that builds on the concepts of emergent behavior and hierarchical organization. Rather than explicitly programming all possible decisions in the vast space of ULSS scenarios, HEB relies on the emergent behaviors induced by local rules at each level of the hierarchy. The authors discuss the modifications to classical IoT architectures required by HEB, as well as the new challenges. They also illustrate the HEB concepts in reference to autonomous vehicles. This use case paves the way to the discussion of new lines of research.
Moore’s Law predicted that the number of transistors on a chip would double approximately every 2 years. However, this trend is arriving at an impasse. Optimizing the usage of the available ...transistors within the thermal dissipation capabilities of the packaging is a pending topic. Multi-core processors exploit coarse-grain parallelism to improve energy efficiency. Vectorization allows developers to exploit data-level parallelism, operating on several elements per instruction and thus, reducing the pressure to the fetch and decode pipeline stages. In this paper, we perform an analysis of different resource optimization strategies for vector architectures. In particular, we expose the need to break down voltage and frequency domains for LLC, ALUs and vector ALUs if we aim to optimize the energy efficiency and performance of our system. We also show the need for a dynamic reconfiguration strategy that adapts vector register length at runtime.
Enabling preemptive multiprogramming on GPUs Tanasic, Ivan; Gelado, Isaac; Cabezas, Javier ...
2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA),
2014-June
Conference Proceeding
Odprti dostop
GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, ...from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service. In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x.
As performance and energy efficiency have become the main challenges for next-generation high-performance computing, asymmetric multi-core architectures can provide solutions to tackle these issues. ...Parallel programming models need to be able to suit the needs of such systems and keep on increasing the application's portability and efficiency. This paper proposes two task scheduling approaches that target asymmetric systems. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph of the application, or by finding the earliest executor of a task. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches with two existing state-of the art heterogeneous schedulers and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to 1.45<inline-formula><tex-math notation="LaTeX">\times</tex-math> <inline-graphic xlink:href="chronaki-ieq1-2633347.gif"/> </inline-formula> in a real 8-core asymmetric system and up to 2.1<inline-formula><tex-math notation="LaTeX">\times</tex-math> <inline-graphic xlink:href="chronaki-ieq2-2633347.gif"/> </inline-formula> in a simulated 32-core asymmetric chip.
•We simulate complex multi-physics problems in a large supercomputer.•The selected problems come from the engineering world.•We show scalability results for implicit and explicit schemes up to 100K ...cores.•We analyze convergence properties for the implicit solver.•The same code, Alya, is used to solve all of them.
Alya is a multi-physics simulation code developed at Barcelona Supercomputing Center (BSC). From its inception Alya code is designed using advanced High Performance Computing programming techniques to solve coupled problems on supercomputers efficiently. The target domain is engineering, with all its particular features: complex geometries and unstructured meshes, coupled multi-physics with exotic coupling schemes and physical models, ill-posed problems, flexibility needs for rapidly including new models, etc. Since its beginnings in 2004, Alya has scaled well in an increasing number of processors when solving single-physics problems such as fluid mechanics, solid mechanics, acoustics, etc. Over time, we have made a concerted effort to maintain and even improve scalability for multi-physics problems. This poses challenges on multiple fronts, including: numerical models, parallel implementation, physical coupling models, algorithms and solution schemes, meshing process, etc. In this paper, we introduce Alya's main features and focus particularly on its solvers. We present Alya's performance up to 100.000 processors in Blue Waters, the NCSA supercomputer with selected multi-physics tests that are representative of the engineering world. The tests are incompressible flow in a human respiratory system, low Mach combustion problem in a kiln furnace, and coupled electro-mechanical contraction of the heart. We show scalability plots for all cases and discuss all aspects of such simulations, including solver convergence.
A Hardware Runtime for Task-Based Programming Models Tan, Xubin; Bosch, Jaume; Alvarez, Carlos ...
IEEE transactions on parallel and distributed systems,
2019-Sept.-1, 2019-9-1, 2019-09-01, Letnik:
30, Številka:
9
Journal Article, Publication
Recenzirano
Odprti dostop
Task-based programming models such as OpenMP 5.0 and OmpSs are simple to use and powerful enough to exploit task parallelism of applications over multicore, manycore and heterogeneous systems. ...However, their software-only runtimes introduce relevant overhead when targeting fine-grained tasks, resulting in performance losses. To overcome this drawback, we present a hardware runtime Picos++ that accelerates critical runtime functions such as task dependence analysis, nested task support, and heterogeneous task scheduling. As a proof-of-concept, the Picos++ hardware runtime has been integrated with a compiler infrastructure that supports parallel task-based programming models. A FPGA SoC running Linux OS has been used to implement the hardware accelerated part of Picos++, integrated with a heterogeneous system composed of 4 symmetric multiprocessor (SMP) cores and several hardware functional accelerators (HwAccs) for task execution. Results show significant improvements on energy and performance compared to state-of-the-art parallel software-only runtimes. With Picos++, applications can achieve up to 7.6x speedup and save up to 90 percent of energy, when using 4 threads and up to 4 HwAccs, and even reach a speedup of 16x over the software alternative when using 12 HwAccs and small tasks.
Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the benefits they provide for scaling core count and memory capacity. Also, the flat memory address space they offer ...considerably improves programmability. However, ccNUMA architectures require sophisticated and expensive cache coherence protocols to enforce correctness during parallel executions, which trigger a significant amount of on- and off-chip traffic in the system. This paper analyses how coherence traffic may be best constrained in a large, real ccNUMA platform comprising 288 cores through the use of a joint hardware/software approach. For several benchmarks, we study coherence traffic in detail under the influence of an added hierarchical cache layer in the directory protocol combined with runtime managed NUMA-aware scheduling and data allocation techniques to make most efficient use of the added hardware. The effectiveness of this joint approach is demonstrated by speedups of 3.14× to 9.97× and coherence traffic reductions of up to 99 percent in comparison to NUMA-oblivious scheduling and data allocation.
Big data processing is becoming a reality in numerous real-world applications. With the emergence of new data intensive technologies and increasing amounts of data, new computing concepts are needed. ...The integration of big data producing technologies, such as wireless sensor networks, Internet of Things, and cloud computing, into cyber-physical systems is reducing the available time to find the appropriate solutions. This paper presents one possible solution for the coming exascale big data processing: a data flow computing concept. The performance of data flow systems that are processing big data should not be measured with the measures defined for the prevailing control flow systems. A new benchmarking methodology is proposed, which integrates the performance issues of speed, area, and power needed to execute the task. The computer ranking would look different if the new benchmarking methodologies were used; data flow systems would outperform control flow systems. This statement is backed by the recent results gained from implementations of specialized algorithms and applications in data flow systems. They show considerable factors of speedup, space savings, and power reductions regarding the implementations of the same in control flow computers. In our view, the next step of data flow computing development should be a move from specialized to more general algorithms and applications.