Program generators play a critical role in generating bug-revealing test programs for compiler testing. However, existing program generators have been tamed nowadays (i.e., compilers have been ...hardened against test programs generated by them), thus calling for new solutions to improve their capability in generating bug-revealing test programs. In this study, we propose a framework named Remgen, aiming to Remanufacture a random program Generator for this purpose. RemgEnaddresses the challenges of the synthesis of diverse code snippets at a low cost and the selection of the bug-revealing code snippets for constructing new test programs. More specifically, RemgEnfirst designs a grammar-aided synthesis mechanism to synthesize diverse code snippets. Then, a grammar coverage-guided strategy is used to select the most diverse code snippets that may be bug-revealing. As a case study to demonstrate the effectiveness of the Remgen framework, we have remanufactured an old C program generator CCG and named it REMCCG. Our evaluation results show that REMCCG can generate significantly more bug-revealing test programs than the original CCG; notably, Remccg has found 56 new bugs for two mature compilers (i.e., GCC and LLVM), of which 37 have already been fixed by their developers.
Summary
This paper presents an extended version of our previous work on using compiler technology to automatically convert sequential C++ data abstractions, for example, queues, stacks, maps, and ...trees, to concurrent lock‐free implementations. A key difference between our work and existing research in software transactional memory (STM) is that our compiler‐based approach automatically selects the best state‐of‐the‐practice nonblocking synchronization method for the underlying sequential implementation of the data structure. The extended material includes a broader collection of the state‐of‐the‐practice lock‐free synchronization techniques, additional formal correctness proofs of the overall integration of the different synchronizations in our system, and a more comprehensive experimental study of the integrated techniques. We evaluate our compiler‐generated nonblocking data structures both by using a collection of micro‐benchmarks, including the Synchrobench suite, and by using a multi‐threaded application Dedup from PARSEC. Our automatically synchronized code attains performance competitive to that of concurrent data structures manually‐written by experts and much better performance than heavier‐weight support by STM.
Full text
Available for:
FZAB, GIS, IJS, KILJ, NLZOH, NUK, OILJ, SAZU, SBCE, SBMB, UL, UM, UPUK
Recently, embedded systems, such as mobile platforms, have multiple processing units that can operate in parallel, such as centralized processing units (CPUs) and neural processing units (NPUs). We ...can use deep‐learning compilers to generate machine code optimized for these embedded systems from a deep neural network (DNN). However, the deep‐learning compilers proposed so far generate codes that sequentially execute DNN operators on a single processing unit or parallel codes for graphic processing units (GPUs). In this study, we propose PartitionTuner, an operator scheduler for deep‐learning compilers that supports multiple heterogeneous PUs including CPUs and NPUs. PartitionTuner can generate an operator‐scheduling plan that uses all available PUs simultaneously to minimize overall DNN inference time. Operator scheduling is based on the analysis of DNN architecture and the performance profiles of individual and group operators measured on heterogeneous processing units. By the experiments for seven DNNs, PartitionTuner generates scheduling plans that perform 5.03% better than a static type‐based operator‐scheduling technique for SqueezeNet. In addition, PartitionTuner outperforms recent profiling‐based operator‐scheduling techniques for ResNet50, ResNet18, and SqueezeNet by 7.18%, 5.36%, and 2.73%, respectively.
Error correction is often indispensable in a modern digital communication system that transmits data at a very high speed. Recently published IEEE Std 802.3bs requires an astounding throughput of 400 ...Gb/s while using the Reed-Solomon code (RS-Code) to protect the integrity of the transmitted data. An RS-Codec supporting such a high throughput demands a significant silicon area. Improper decisions on the parameters of the parallel architecture could lead to unnecessarily high costs in the implementation. We have developed a compiler to solve this problem. First, the Codec satisfying IEEE Std 802.3bs using RS(544, 514) is parameterized, in a way that the throughput can be boosted on demand by setting some "configuration." Second, an area- and power-efficient RS-Codec design satisfying a target throughput using a specific process can be inferred by our compiler in just minutes, and thereby easy process migration is supported. Experimental results using 28 and 90-nm CMOS processes are presented to demonstrate their effectiveness.
•This paper presents an automatic and effective method to parallelize applications.•The KIR models the whole program handling syntactical variations in the source code.•Our technique builds a global ...OpenMP parallelization targeting multicore processors.•The benchmarks include linear algebra routines and applications from SPEC CPU2000.•The automatic parallelization of GCC, Intel and PLUTO compilers is evaluated.
The widespread use of multicore processors is not a consequence of significant advances in parallel programming. In contrast, multicore processors arise due to the complexity of building power-efficient, high-clock-rate, single-core chips. Automatic parallelization of sequential applications is the ideal solution for making parallel programming as easy as writing programs for sequential computers. However, automatic parallelization remains a grand challenge due to its need for complex program analysis and the existence of unknowns during compilation. This paper proposes a new method for converting a sequential application into a parallel counterpart that can be executed on current multicore processors. It hinges on an intermediate representation based on the concept of domain-independent kernel (e.g., assignment, reduction, recurrence). Such kernel-centric view hides the complexity of the implementation details, enabling the construction of the parallel version even when the source code of the sequential application contains different syntactic variations of the computations (e.g., pointers, arrays, complex control flows). Experiments that evaluate the effectiveness and performance of our approach with respect to state-of-the-art compilers are also presented. The benchmark suite consists of synthetic codes that represent common domain-independent kernels, dense/sparse linear algebra and image processing routines, and full-scale applications from SPEC CPU2000.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK
Simulink compiler testing is important since all cyber-physical system (CPS) models are required to be compiled by Simulink compiler. Current testing processes use CPS models generated by CPS model ...generators for testing. Since the effectiveness of CPS model generators heavily relies on suitable generator configurations, existing approaches randomize configurations or infer configurations with historical bug information to generate diverse bug-triggering CPS models. However, these approaches are designed for general-purpose compilers (e.g., GCC), which have two challenges when testing Simulink compiler, namely, the CPS model representation challenge on representing CPS models for diversity measurement and the configuration learning challenge on learning configurations to generate diverse CPS models. To address these challenges, we propose R einforcement l E arning-based CO nfigu R ation D iversification (RECORD), a new configuration diversification approach. RECORD has a feature vectorization component, which addresses the first challenge by representing CPS models as feature vectors to capture the local and global characteristics of CPS models for diversity measurement. RECORD then uses a reinforcement learning component to generate diverse CPS models based on the learned relationship between configuration updates and diversity changes, thus addressing the second challenge. Experiments demonstrate that within three months, RECORD reported 11 confirmed Simulink compiler bugs, significantly outperforming the state-of-the-art configuration diversification approaches. RECORD can also facilitate different testing strategies to find more bugs.
Quantum computers promise to transform our notions of computation by offering a completely new paradigm. To achieve scalable quantum computation, optimizing compilers and a corresponding software ...design flow will be essential. We present a software architecture for compiling quantum programs from a high-level language program to hardware-specific instructions. We describe the necessary layers of abstraction and their differences and similarities to classical layers of a computer-aided design flow. For each layer of the stack, we discuss the underlying methods for compilation and optimization. Our software methodology facilitates more rapid innovation among quantum algorithm designers, quantum hardware engineers, and experimentalists. It enables scalable compilation of complex quantum algorithms and can be targeted to any specific quantum hardware implementation.
hiCUDA: High-Level GPGPU Programming Han, Tianyi David; Abdelrahman, Tarek S
IEEE transactions on parallel and distributed systems,
2011-Jan., 2011-01-00, 20110101, Volume:
22, Issue:
1
Journal Article
Peer reviewed
Graphics Processing Units (GPUs) have become a competitive accelerator for applications outside the graphics domain, mainly driven by the improvements in GPU programmability. Although the Compute ...Unified Device Architecture (CUDA) is a simple C-like interface for programming NVIDIA GPUs, porting applications to CUDA remains a challenge to average programmers. In particular, CUDA places on the programmer the burden of packaging GPU code in separate functions, of explicitly managing data transfer between the host and GPU memories, and of manually optimizing the utilization of the GPU memory. Practical experience shows that the programmer needs to make significant code changes, often tedious and error-prone, before getting an optimized program. We have designed hiCUDA}, a high-level directive-based language for CUDA programming. It allows programmers to perform these tedious tasks in a simpler manner and directly to the sequential code, thus speeding up the porting process. In this paper, we describe the hiCUDA} directives as well as the design and implementation of a prototype compiler that translates a hiCUDA} program to a CUDA program. Our compiler is able to support real-world applications that span multiple procedures and use dynamically allocated arrays. Experiments using nine CUDA benchmarks show that the simplicity hiCUDA} provides comes at no expense to performance.
Abstract
OpenMP supports parallel incremental development, and has become a mainstream parallel programming standard for shared memory systems. A parallel compiler running on multi-core DSP is ...designed for OpenMP programs. The main achievement includes translator and runtime. The translator converts OpenMP instructions in source files into function calls in runtime where the specific implementation is provided. The core problem of parallel compiler is how to design parallel strategy to allocate computing tasks to each core, which corresponds to the concept of parallel domain in OpenMP standard. The compiler realizes the of master-slave core by transforming parallel instructions and supporting runtime, which has guiding significance for the design of parallel compiler.
In this paper, we introduce Canis, a high‐level domain‐specific language that enables declarative specifications of data‐driven chart animations. By leveraging data‐enriched SVG charts, its grammar ...of animations can be applied to the charts created by existing chart construction tools. With Canis, designers can select marks from the charts, partition the selected marks into mark units based on data attributes, and apply animation effects to the mark units, with the control of when the effects start. The Canis compiler automatically synthesizes the Lottie animation JSON files Aira, which can be rendered natively across multiple platforms. To demonstrate Canis’ expressiveness, we present a wide range of chart animations. We also evaluate its scalability by showing the effectiveness of our compiler in reducing the output specification size and comparing its performance on different platforms against D3.
Full text
Available for:
BFBNIB, DOBA, FZAB, GIS, IJS, IZUM, KILJ, NLZOH, NUK, OILJ, PILJ, PNG, SAZU, SBCE, SBMB, UILJ, UKNU, UL, UM, UPUK