Processors with 100s of threads of execution are among the state-of-the-art in high-end computing systems. This transition to many-core computing has required the community to develop new algorithms ...to overcome significant latency bottlenecks through massive concurrency. However, implementing efficient parallel runtimes that can scale up to high concurrency levels with extremely fine-grained tasks remains a challenge. Existing techniques do not scale to a large number of threads due to the high cost of synchronization in concurrent data structures. We present a thorough analysis of various synchronization mechanisms including mutex, semaphore, spinlock and atomic fetch-and-add that are typically used to build concurrent data structures in task-parallel runtime systems. To overcome these limitations, in a recent work we proposed XQueue, a novel lock-less concurrent queuing system with relaxed ordering semantics that is geared towards realizing scalability up to hundreds of concurrent threads. In this work, we extend XQueue and present X-OpenMP, a library for enabling extremely fine-grained parallelism on modern many-core systems with hundreds of cores. Work stealing is a popular choice for load balancing in task-based runtime systems as it efficiently distributes the load across worker threads; however, traditional approaches rely on synchronization primitives and thus work stealing can incur overheads. Here we implement a lock-less algorithm for work stealing for total-store order (TSO) memory architectures and evaluate the performance using micro and macro benchmarks. We compare the performance of X-OpenMP with native LLVM OpenMP, GNU OpenMP, OpenCilk and oneTBB implementations using task-based linear algebra routines from PLASMA numerical library, Strassen’s matrix multiplication from the BOTS Benchmark Suite, and the Unbalanced Tree Search benchmark. Applications parallelized using OpenMP can run without modification by simply linking against the X-OpenMP library. X-OpenMP achieves up to 40X speedup compared to GNU OpenMP, up to 2X speedup compared to the native LLVM OpenMP, up to 6X speedup compared to OpenCilk and up to 5X speedup compared to oneTBB implementations. The tasking overheads in X-OpenMP are reduced by 50% compared to the native LLVM OpenMP.
•The overhead of synchronization methods is quantified on modern hardware with hundreds of cores.•A novel approach for building a low-overhead parallel runtime system using lockless concurrent queuing and lockless workstealing is proposed.•The low-overhead lockless runtime is integrated with OpenMP and applications can run unmodified just by linking against the proposed library.•Benchmarks results and learnings on several HPC applications reported.
Ionization chambers were designed and constructed to determine the kerma rates in various materials within several centimeters of a Training, Research, Isotopes, General Atomics (TRIGA) reactor core ...operating at 1 MW. The primary aim of this article was to compare kerma measurements with the advanced Monte Carlo code calculations of nuclear heating. Wall thickness, collection gap, and fill gas pressure were chosen to satisfy Bragg-Gray criteria, so that the measured ionization current was related to the kerma rate in the wall material. Chamber wall materials composed of low mass number elements, including hydrogen-rich C552 air-equivalent plastic and beryllium, were selected to measure the kerma due to fast neutron elastic scattering. By operating these neutron sensitive chambers coincidentally with relatively neutron insensitive chambers composed of aluminum and Zircaloy-4, we were able to measure the total heating due to fast neutrons and gamma rays in a material and to differentiate these heating components. A chamber composed of borated stainless steel was used in a similar fashion to measure the thermal neutron flux. The total kerma rate was also measured in various materials typically found in a reactor core. Chamber collection volumes were initially determined using ambient air fill gas and National Institute of Standards and Technology (NIST)-traceable air-kerma rates from a 60 Co source. All chambers were sealed with argon gas to provide thermal and compositional stability. Chamber properties, including stability, saturation, and gas-phase mass subject to charge collection, were determined using the 60 Co source. Chambers were operated for approximately 30 min adjacent to the reactor core, and the integrity of gas seals was subsequently verified by repeating the measurement with the 60 Co source.
Display omitted
► The status of current full-scale commercial SCWO facilities is reviewed. ► There are currently 6 commercial firms active in SCWO. ► There are 2 full-scale commercial SCWO plants ...(and 1 near-critical hydrolysis plant) that are operating in the world today. ► There are 9 full-scale SCWO plants in the planning stages, with 7 expected to begin operation within the next 1–2 years.
After more than three decades since its potential was first recognized, supercritical water oxidation (SCWO) remains an innovative and viable treatment technology for destruction of aqueous based organic wastes. An extensive data base of destruction efficiencies, corrosion data, and salt phase behavior has been developed over the years through the combined efforts of many investigators at both the fundamental research and commercial level. As a result, SCWO technology has been increasingly utilized in a variety of full-scale designs and applications, handling feeds as diverse as polychlorinated biphenyls (PCBs), sewage sludge, spent catalysts, and chemical weapons. This paper reviews the status of current full-scale commercial SCWO facilities around the world, focusing on the unique challenges and design strategies employed by different companies for corrosion and salt precipitation control in each application. A summary of past commercial SCWO activity as well as future plans among the current active SCWO companies is also included.
Extensive polling in shared-memory manycore systems can lead to contention, decreased throughput, and poor energy efficiency. Both lock implementations and the general-purpose atomic operation, ...load-reserved/store-conditional (LRSC), cause polling due to serialization and retries. To alleviate this overhead, we propose LRwait and SCwait, a synchronization pair that eliminates polling by allowing contending cores to sleep while waiting for previous cores to finish their atomic access. As a scalable implementation of LRwait, we present Colibri, a distributed and scalable approach to managing LRwait reservations. Through extensive benchmarking on an open-source RISC-V platform with 256 cores, we demonstrate that Colibri outperforms current synchronization approaches for various concurrent algorithms with high and low contention regarding throughput, fairness, and energy efficiency. With an area overhead of only 6%, Colibri outperforms LRSC-based implementations by a factor of 6.5× in terms of throughput and 7.1× in terms of energy efficiency.
Summary
We present a technique and automated toolbox for randomized testing of C compilers. Unlike prior compiler‐testing approaches, we generate concurrent test cases in which threads communicate ...using fine‐grained atomic operations, and we study actual compiler implementations rather than mappings. Our approach is (1) to generate test cases with precise oracles directly from an axiomatization of the C concurrency model; (2) to apply metamorphic fuzzing to each test case, aiming to amplify the coverage they are likely to achieve on compiler codebases; and (3) to execute each fuzzed test case extensively on a range of real machines. Our tool, C4, benefits compiler developers in two ways. First, test cases generated by C4 can achieve line coverage of parts of the LLVM C compiler that are reached by neither the LLVM test suite nor an existing (sequential) C fuzzer. This information can be used to guide further development of the LLVM test suite and can also shed light on where and how concurrency‐related compiler optimizations are implemented. Second, C4 can be used to gain confidence that a compiler implements concurrency correctly. As evidence of this, we show that C4 achieves high strong mutation coverage with respect to a set of concurrency‐related mutants derived from a recent version of LLVM and that it can find historic concurrency‐related bugs in GCC. As a by‐product of concurrency‐focused testing, C4 also revealed two previously unknown sequential compiler bugs in recent versions of GCC and the IBM XL compiler.
Truth pluralists say that truth-bearers in different “discourses”, “domains”, “domains of discourse”, or “domains of inquiry” are apt to be true in different ways – for instance, that mathematical ...discourse or ethical discourse is apt to be true in a different way to ordinary descriptive or scientific discourse. Moreover, the notion of a “domain” is often explicitly employed in formulating pluralist theories of truth. Consequently, the notion of a “domain” is attracting increasing attention, both critical and constructive. I argue that this is a red herring. First, I identify the theoretical role for which pluralists appeal to domains, which is to answer what I call the “Individuation Problem”: saying what determines the way in which a particular truth-bearer is apt to be true. Second, I argue that pluralists need not appeal to domains for this purpose. I thus conclude that, despite the usual way of glossing the view, there is no role for the notion of a “domain” to play in the pluralist’s theory of truth. I argue that this defuses the “Problem of Mixed Atomics” and allows the pluralist to sidestep potentially intractable disputes about the nature of domains.
As the key rotating parts in machinery, it is crucial to extract the latent fault features of rolling bearing in machinery condition monitoring to avoid the occurrence of sudden accidents. ...Unfortunately, the latent fault features are hard to extract by using the traditional signal processing method such as envelope demodulation because the effect of envelope demodulation is influenced strongly by the degree of background noise. Sparse decomposition, as a new promising method being able of capturing the latent fault feature components buried in the vibration signal, has attracted a lot of attentions, especially the predefined dictionary-based sparse decomposition methods. However, the feature extraction effect of the predefined dictionary-based sparse decomposition depends on whether the prior knowledge of the analyzed signal is sufficient or not. To overcome the above problems, a feature extraction method of latent fault components of rolling bearing based on self-learned sparse atomics and frequency band entropy is proposed in the article. First, a self-learned sparse atomics method is applied on the early weak vibration signal of rolling bearing and several self-learned atomics are obtained. Then, the self-learned atomics owing bigger kurtosis values are selected and used to reconstruct the vibration signal to remove the other interference signals. Subsequently, the frequency band entropy method is used to analyze the reconstructed vibration signal, and the optimal parameter of band-pass filter could be calculated. At last, the reconstructed vibration signal is filtered using the optimal band-pass filter, envelope demodulation on the filtered signal is applied, and better fault feature is extracted. The feasibility and effectiveness of the proposed method are verified through the vibration data of the accelerated fatigue life test of rolling bearing. Besides, the analysis results of the same vibration data using Autogram and spectral kurtosis methods are also presented to highlight the superiority of the proposed method.
In recent years, due to their wide availability and ease of programming, GPUs have emerged as the accelerator of choice for a wide variety of applications including graph analytics and machine ...learning training. These applications use atomics to update shared global variables. However, since GPUs do not efficiently support atomics, this limits scalability. We propose to use hardware-software co-design to address this bottleneck and improve scalability. At the software level, we leverage recently proposed extensions to the GPU memory consistency model to identify atomic updates where the ordering can be relaxed. For example, in these algorithms the updates are commutative. At the hardware level, we propose a buffering mechanism that extends the reconfigurable local SRAM per SM. By buffering partial updates of these atomics locally, our design increases reuse, reduces atomic serialization cost, and minimizes overhead. Thus, our mechanism alleviates the impact of global atomic updates and improves performance by 28%, energy by 19%, and network traffic by 19% on average and outperforms hLRC and PHI.
In general, race condition can be resolved by introducing synchronisations or breaking data dependencies. Atomic operations and graph coloring are the two typical approaches to avoid race condition. ...Graph coloring algorithms have been generally considered winning algorithms in the literature due to their lock free implementations. In this paper, we present the GPU-accelerated algorithms of the unstructured cell-centered finite volume Computational Fluid Dynamics (CFD) software framework named PHengLEI which was originally developed for aerodynamics applications with arbitrary hybrid meshes. Overall, the newly developed GPU framework demonstrate up to 4.8 speedup comparing with 18 MPI tasks run on the latest Intel CPU node. Furthermore, the enormous efforts have been invested to optimize data dependencies which could lead to race condition due to unstructured mesh indirect addressing and related reduction math operations. With careful comparison between our optimised graph coloring and atomic operations using a series of numerical tests with different mesh sizes, the results show that atomic operations are more efficient than our optimised graph coloring in all of the test cases on Nvidia Tesla GPU V100. Specifically, for the summation operation, using atomicAdd is twice as fast as graph coloring. For the maximum operation, a speedup of 1.5 to 2 is found for atomicMax vs. graph coloring.