Cyber-physical systems (CPS) consist of a variety of multicore architectures, including central processing units (CPU) and graphical processing units (GPU). In general, programmers assign sequential ...programs to the CPU while parallel applications are assigned to the GPU. This article provides a method for mapping an OpenCL application to a heterogeneous multicore architecture using active fuzzy learning to determine the adequacy and processing capabilities of the application. During learning, subsamples are created by developing a machine learning-based device suitability classifier that predicts which processors would have excessive computational compatibility for running OpenCL programs. In addition, this study integrates an active learning model based on entropy with a fuzzification model to find nonoverlapping patterns. To minimize rule generation, the fuzzification-based weighted probabilistic technique is presented. The defuzzification process is optimized by using uncertainty values in conjunction with classification probability. In addition, 20 different features are proposed for extraction using the newly developed LLVM-based static analyzer. The correlation analysis approach is used to determine the optimal subset of features. The synthetic minority oversampling approach with and without feature selection is used to differentiate the class imbalance problem. Instead of manually modifying the machine learning classifier, a tree-based pipeline construction approach is used to determine the optimal classifier and associated hyperparameters. Experiments are then conducted on a set of benchmarks to verify the performance of the designed model. The results show that by increasing the number of training examples and including an entropy uncertainty measure, the proposed model is able to support and improve decision boundaries. We achieved a high F-measure of 0.77 and a ROC of 0.92 by optimizing and reducing the feature subsets.
Written by leaders in the parallel computing and OpenCL communities, this book includes multiple case studies, examples, and source code, and teaches OpenCL and parallel programming for complex ...systems that may include a variety of device architectures. --
In the past decade, high performance compute capabilities exhibited by heterogeneous GPGPU platforms have led to the popularity of data parallel programming languages such as CUDA and OpenCL. ...Developing high performance parallel programming solutions using such languages involve a steep learning curve due to the complexity of the underlying heterogeneous compute devices and their impact on performance. This has led to the emergence of several High Performance Computing frameworks which provide high-level abstractions for easing the development of data-parallel applications on heterogeneous platforms. However, the scheduling decisions undertaken by such frameworks only exploit coarse-grained concurrency in data parallel applications. In this paper, we propose PySchedCL , a framework which explores fine-grained concurrency aware scheduling decisions that harness the power of heterogeneous CPU/GPU architectures efficiently. We showcase the efficacy of such scheduling mechanisms over existing coarse-grained dynamic scheduling schemes by conducting extensive experimental evaluations for a diverse set of popular Deep Learning benchmarks.
Future experiments in high-energy physics will pose stringent requirements to computing, in particular to real-time data processing. As an example, the CBM experiment at FAIR Germany intends to ...perform online data selection exclusively in software, without using any hardware trigger, at extreme interaction rates of up to 10 MHz. In this article, we describe how heterogeneous computing platforms, Graphical Processing Units (GPUs) and CPUs, can be used to solve the associated computing problems on the example of the first-level event selection process sensitive to J/ψ decays using muon detectors. We investigate and compare pure parallel computing paradigms (Posix Thread, OpenMP, MPI) and heterogeneous parallel computing paradigms (CUDA, OpenCL) on both CPU and GPU architectures and demonstrate that the problem under consideration can be accommodated with a moderate deployment of hardware resources, provided their compute power is made optimal use of. In addition, we compare OpenCL and pure parallel computing paradigms on CPUs and show that OpenCL can be considered as a single parallel paradigm for all hardware resources.
•Systematic study and development of an event selection algorithm for the CBM-MUCH.•The FLES process suppresses the archival data rate by almost two orders of magnitude.•Process satisfy the CBM requirements for high-rate data taking at 107 events per second.•Almost a million events per second can be processed using a single NVIDIA Tesla GPU.•Comparison performed between OpenCL, pthread, OpenMP and MPI as open-source concurrency paradigms.
•Extensions to the OpenCL API are proposed to support automatic task scheduling.•An example runtime system called MultiCL is designed and optimized.•MultiCL achieves near ideal task-device mapping ...with negligible runtime overhead.
The OpenCL specification tightly binds a command queue to a specific device. For best performance, the user has to find the ideal queue-device mapping at command queue creation time, an effort that requires a thorough understanding of the underlying device architectures and kernels in the program. In this paper, we propose to add scheduling attributes to the OpenCL context and command queue objects that can be leveraged by an intelligent runtime scheduler to automatically perform ideal queuedevice mapping. Our proposed extensions enable the average OpenCL programmer to focus on the algorithm design rather than scheduling and to automatically gain performance without sacrificing programmability. As an example, we design and implement an OpenCL runtime for task-parallel workloads, called MultiCL, which efficiently schedules command queues across devices.
Our case studies include the SNU benchmark suite and a real-world seismology simulation. To benefit from our runtime optimizations, users have to apply our proposed scheduler extensions to only four source lines of code, on average, in existing OpenCL applications. We evaluate both single-node and multinode experiments and also compare with SOCL, our closest related work. We show that MultiCL maps command queues to the optimal device set in most cases with negligible runtime overhead.
This brief proposes and evaluates several OpenCL-based implementations of the Secure Hash Algorithm-3 (SHA-3) co-processor. These implementations are developed based on OpenCL optimization techniques ...and their impact on throughput and speedup are reported. The experimental results show that the proposed optimization techniques achieve a 310x speedup when compared to an unoptimized baseline implementation. Moreover, the best reported optimized SHA-3 co-processor achieves a 22.36 Gbps throughput, which is 2 times higher than the best previously published SHA-3 implemented using a high-level synthesis tool and even higher than the performance of most previously reported implementations developed using hardware description languages (HDLs). As a result, an efficient OpenCL-based SHA-3 co-processor suitable for FPGA platforms is proposed. To our knowledge, the reported SHA3 co-processor is the first OpenCL implementation targeting an FPGA-based edge computing platform.
Codes written in a naive way seldom effectively exploit the computing resources, while writing optimized codes is usually a complex task that requires certain levels of expertise. This problem is ...further increased in the presence of heterogeneous devices, which present more tunable parameters than regular CPUs and high sensitivity to the optimization decisions taken. Furthermore, portability is an added concern given the wide variety of accelerators available. This paper tackles this problem adding an automatic optimizer to a library that already provides an easy and portable way to program heterogeneous devices, the Heterogeneous Programming Library (HPL). Our optimizer takes as input a simple version of a code and then tunes it for the device where it is going to be executed by performing the most usual set of optimizations applicable in heterogeneous devices. These optimizations are parametrized using a set of optimization parameters that need to be tuned for the device. The HPL library has also been equipped with an autotuner that can be used to this purpose. The effectiveness of the autotuner and the optimizer has been tested on several codes and devices. The results show that the combination of the autotuner and the optimizer make the tested codes 16 times faster on average than the original codes written by the programmer.
•This works adds an automatic optimizer for heterogeneous devices to the Heterogeneous Programming Library (HPL).•The optimizer performs the most usual set of optimizations applicable in heterogeneous devices.•The parameters that guide these optimizations need to be tuned.•The HPL library has also been equipped with an autotuner for these parameters.•The combination of the autotuner and the optimizer make the tested codes 16 times faster on average.
The simulation of heat flow through heterogeneous material is important for the design of structural and electronic components. Classical analytical solutions to the heat equation PDE are not known ...for many such domains, even those having simple geometries. The finite element method can provide approximations to a weak form continuum solution, with increasing accuracy as the number of degrees of freedom in the model increases. This comes at a cost of increased memory usage and computation time; even when taking advantage of sparse matrix techniques for the finite element system matrix. We summarize recent approaches in solving problems in structural mechanics and steady state heat conduction which do not require the explicit assembly of any system matrices, and adapt them to a method for solving the time-depended flow of heat. These approaches are highly parallelizable, and can be performed on graphical processing units (GPUs). Furthermore, they lend themselves to the simulation of heterogeneous material, with a minimum of added complexity. We present the mathematical framework of assembly-free FEM approaches, through which we summarize the benefits of GPU computation. We discuss our implementation using the OpenCL computing framework, and show how it is further adapted for use on multiple GPUs. We compare the performance of single and dual GPUs implementations of our method with previous GPU computing strategies from the literature and a CPU sparse matrix approach. The utility of the novel method is demonstrated through the solution of a real-world coefficient inverse problem that requires thousands of transient heat flow simulations, each of which involves solving a 1 million degree of freedom linear system over hundreds of time steps.
•The system matrix for heat conduction FEM is decomposed for massive parallelism.•The assembly-free methods are well-suited for heat flow through heterogeneous media.•Three implementations are described and compared with a serial sparse matrix method.•Implementations are provided for use on single and dual GPUs with OpenCL.
Two aspects of improvements are proposed for the OpenCL-based implementation of the social field pedestrian model. In the aspect of algorithm, a method based on the idea of divide-and-conquer is ...devised in order to overcome the problem of global memory depletion when fields are of a larger size. This is of importance for the study of finer pedestrian walking behavior, which usually requires larger fields. In the aspect of computation, the OpenCL heterogeneous framework is thoroughly studied. Factors that may affect the numerical efficiency are evaluated, with regarding to the social field model previously proposed. This includes usage of local memory, deliberate patch of data structures for avoidance of bank conflicts, and so on. Experiments disclose that the numerical efficiency is brought to an even higher level. Compared with the CPU model and the previous GPU model, the present GPU model can be at most 71.56 and 13.3 times faster, respectively.