Firedrake Rathgeber, Florian; Ham, David A.; Mitchell, Lawrence ...
ACM transactions on mathematical software,
01/2017, Letnik:
43, Številka:
3
Journal Article
Recenzirano
Odprti dostop
Firedrake is a new tool for automating the numerical solution of partial differential equations. Firedrake adopts the domain-specific language for the finite element method of the FEniCS project, but ...with a pure Python runtime-only implementation centered on the composition of several existing and new abstractions for particular aspects of scientific computing. The result is a more complete separation of concerns that eases the incorporation of separate contributions from computer scientists, numerical analysts, and application specialists. These contributions may add functionality or improve performance.
Firedrake benefits from automatically applying new optimizations. This includes factorizing mixed function spaces, transforming and vectorizing inner loops, and intrinsically supporting block matrix operations. Importantly, Firedrake presents a simple public API for escaping the UFL abstraction. This allows users to implement common operations that fall outside of pure variational formulations, such as flux limiters.
Reliable Actors with Retry Orchestration Tardieu, Olivier; Grove, David; Bercea, Gheorghe-Teodor ...
Proceedings of ACM on programming languages,
06/2023, Letnik:
7, Številka:
PLDI
Journal Article
Recenzirano
Odprti dostop
Cloud developers have to build applications that are resilient to failures and interruptions. We advocate for a fault-tolerant programming model for the cloud based on actors, retry orchestration, ...and tail calls. This model builds upon persistent data stores and message queues readily available on the cloud. Retry orchestration not only guarantees that (1) failed actor invocations will be retried but also that (2) completed invocations are never repeated and (3) it preserves a strict happen-before relationship across failures within call stacks. Tail calls can break complex tasks into simple steps to minimize re-execution during recovery. We review key application patterns and failure scenarios. We formalize a process calculus to precisely capture the mechanisms of fault tolerance in this model. We briefly describe our implementation. Using an application inspired by a typical enterprise scenario, we validate the functional correctness of our implementation and assess the impact of fault preparedness and recovery on performance.
We present a generic algorithm for numbering and then efficiently iterating over the data values attached to an extruded mesh. An extruded mesh is formed by replicating an existing mesh, assumed to ...be unstructured, to form layers of prismatic cells. Applications of extruded meshes include, but are not limited to, the representation of three-dimensional high aspect ratio domains employed by geophysical finite element simulations. These meshes are structured in the extruded direction. The algorithm presented here exploits this structure to avoid the performance penalty traditionally associated with unstructured meshes. We evaluate the implementation of this algorithm in the Firedrake finite element system on a range of low compute intensity operations which constitute worst cases for data layout performance exploration. The experiments show that having structure along the extruded direction enables the cost of the indirect data accesses to be amortized after 10–20 layers as long as the underlying mesh is well ordered. We characterize the resulting spatial and temporal reuse in a representative set of both continuous-Galerkin and discontinuous-Galerkin discretizations. On meshes with realistic numbers of layers the performance achieved is between 70 and 90 % of a theoretical hardware-specific limit.
We study and systematically evaluate a class of composable code transformations that improve arithmetic intensity in local assembly operations, which represent a significant fraction of the execution ...time in finite element methods. Their performance optimization is indeed a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions, which vary among different problems, make it hard to determine an optimal sequence of successful transformations. Our investigation has resulted in the implementation of a compiler (called COFFEE) for local assembly kernels, fully integrated with a framework for developing finite element methods. The compiler manipulates abstract syntax trees generated from a domain-specific language by introducing domain-aware optimizations for instruction-level parallelism and register locality. Eventually, it produces C code including vector SIMD intrinsics. Experiments using a range of real-world finite element problems of increasing complexity show that significant performance improvement is achieved. The generality of the approach and the applicability of the proposed code transformations to other domains is also discussed.
Efficient Fork-Join on GPUs Through Warp Specialization Jacob, Arpith Chacko; Eichenberger, Alexandre E; Sung, Hyojin ...
2017 IEEE 24th International Conference on High Performance Computing (HiPC)
Conference Proceeding
Graphics Processing Units (GPUs) are increasingly used to accelerate portions of general-purpose applications. Higher level language extensions have been proposed to help non-experts bridge the gap ...between a host and the GPU's threading model. Recent updates to the OpenMP standard allow a user to parallelize code on a GPU using the well known fork-join programming model for CPUs. Mapping this model to the architecturally visible threading model of typical GPUs has been challenging. In this work we propose a novel approach using the technique of Warp Specialization. We show how to specialize one warp (a unit of 32 GPU threads) to handle sequential code on a GPU. When this master warp reaches a user-specified parallel region, it awakens unused GPU warps to collectively execute the parallel code. Based on this method, we have implemented a Clang-based, OpenMP 4.5 compliant, open source compiler for GPUs. Our work achieves a 3.6x (and up to 32x) performance improvement over a baseline that does not exploit fork-join parallelism on an NVIDIA k40m GPU across a set of 25 kernels. Compared to state-of-the-art compilers (Clang-ykt, GCC-OpenMP, GCC-OpenACC) our work is 2.1 - 7.6x faster. Our proposed technique is simpler to implement, robust, and performant.
Considering a 2D matrix of positive and negative numbers, how might one draw a rectangle within it whose contents sum higher than all other rectangles'? This fundamental problem, commonly known the ...maximum rectangle problem or subwindow search, spans many computational domains. Yet, the problem has not been solved without demanding computational resources at least linearly proportional to the size of the matrix. In this work, we present a new approach to the problem which achieves sublinear time and memory complexities by interpolating between a small amount of equidistant sections of the matrix. Applied to natural images, our solution outperforms the state-of-the-art by achieving an 11x increase in speed and memory efficiency at 99% comparative accuracy. In general, our solution outperforms existing solutions when matrices are sufficiently large and a marginal decrease in accuracy is acceptable, such as in many problems involving natural images. As such, it is well-suited for real-time application and in a variety of computationally hard instances of the maximum rectangle problem.
Cloud developers have to build applications that are resilient to failures and interruptions. We advocate for a fault-tolerant programming model for the cloud based on actors, retry orchestration, ...and tail calls. This model builds upon persistent data stores and messages queues readily available on the cloud. Retry orchestration not only guarantees that (1) failed actor invocations will be retried but also that (2) completed invocations are never repeated and (3) it preserves a strict happen-before relationship across failures within call stacks. Tail calls can break complex tasks into simple steps to minimize re-execution during recovery. We review key application patterns and failure scenarios. We formalize a process calculus to precisely capture the mechanisms of fault tolerance in this model. We briefly describe our implementation. Using an application inspired by a typical enterprise scenario, we validate the functional correctness of our implementation and assess the impact of fault preparedness and recovery on performance.
Specialized Kernels for Optimizing GPU Offload in OpenMP Chakrabarti, Dhruva; Rodgers, Gregory; Bertolli, Carlo ...
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis,
11/2023
Conference Proceeding
Programming models for general purpose GPU (GPGPU) computing include grid and non-grid languages. Grid languages like CUDA and HIP map directly to the GPU hardware and can extract high performance ...from applications. However, this low-level programming approach makes them more difficult to program than non-grid languages such as C, C++, and Fortran with OpenMP target offload. Furthermore, grid languages often have more portability issues than non-grid languages. However, code generated from non-grid languages using automatic compiler and runtime techniques often incur higher overhead while generating GPU kernels for target regions.
This paper discusses compiler and runtime techniques to generate specialized, high-performance kernels for OpenMP target regions in certain common situations. We outline the conditions under which specialized kernels are generated for OpenMP target regions, both with and without reduction clauses. Experimental results on AMD GPUs indicate that a large percentage of OpenMP target regions are amenable to specialization and consequent improvement in performance.
Offloading Support for OpenMP in Clang and LLVM Antao, Samuel F.; Bataev, Alexey; Jacob, Arpith C. ...
2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC),
2016-Nov.
Conference Proceeding
OpenMP 4.5 allows performance portability by enabling users to write a single application code and run it on multiple types of accelerators. Our goal is to deliver a high-performance implementation ...of OpenMP into the Clang/LLVM project. This paper describes our initial work to fully support code generation for OpenMP device offloading constructs. We describe a new driver implementation to handle compilation for multiple host and device types, which generalizes the current Clang CUDA implementation and supports OpenMP. It can also be extended to any offloading based language including OpenCL and OpenACC. We describe an implementation of the OpenMP offloading constructs in the runtime library, giving details on two critical aspects. First, how data mapping is implemented. Second, how different device code sections in the binaries are handled to enable application execution on different devices without recompilation. We report initial performance on a prototype that extends current LLVM trunk repositories with all our proposed patches plus future ones, showing near-CUDA performance of our solution.
Deep neural network models are becoming increasingly popular and have been used in various tasks such as computer vision, speech recognition, and natural language processing. Machine learning models ...are commonly trained in a resource-rich environment and then deployed in a distinct environment such as high availability machines or edge devices. To assist the portability of models, the open-source community has proposed the Open Neural Network Exchange (ONNX) standard. In this paper, we present a high-level, preliminary report on our onnx-mlir compiler, which generates code for the inference of deep neural network models described in the ONNX format. Onnx-mlir is an open-source compiler implemented using the Multi-Level Intermediate Representation (MLIR) infrastructure recently integrated in the LLVM project. Onnx-mlir relies on the MLIR concept of dialects to implement its functionality. We propose here two new dialects: (1) an ONNX specific dialect that encodes the ONNX standard semantics, and (2) a loop-based dialect to provide for a common lowering point for all ONNX dialect operations. Each intermediate representation facilitates its own characteristic set of graph-level and loop-based optimizations respectively. We illustrate our approach by following several models through the proposed representations and we include some early optimization work and performance results.