Unprecedented climate change and anthropogenic activities have induced increasing ecohydrological problems, motivating the development of large‐scale hydrologic modeling for solutions. Water ...age/quality is as important as water quantity for understanding the terrestrial water cycle. However, scientific progress in tracking water parcels at large‐scale with high spatiotemporal resolutions is far behind that in simulating water balance/quantity owing to the lack of powerful modeling tools. EcoSLIM is a particle tracking model working with ParFlow‐CLM that couples integrated surface‐subsurface hydrology with land surface processes. Here, we demonstrate a parallel framework on distributed, multi‐Graphics Processing Unit platforms with Compute Unified Device Architecture‐Aware Message Passing Interface for accelerating EcoSLIM to continental‐scale. In tests from catchment‐, to regional‐, and then to continental‐scale using 25‐million to 1.6‐billion particles, EcoSLIM shows significant speedup and excellent parallel performance. The parallel framework is portable to atmospheric and oceanic particle tracking models, where the parallelization is inadequate, and a standard parallel framework is also absent. The parallelized EcoSLIM is a promising tool to accelerate our understanding of the terrestrial water cycle and the upscaling of subsurface hydrology to Earth System Models.
Plain Language Summary
Studies of water ages at multiple spatiotemporal scales are urgent to better understand the connections between different hydrologic compartments. Climate change and anthropogenic activities make this requirement more pressing. Lagrangian particle tracking is a powerful tool to simulate water ages. However, it is computationally demanding, which hampers its wide application. In this study, we provide a Lagrangian particle tracking model, EcoSLIM, with a novel parallel framework that enables it to handle large‐scale water age simulations with high spatiotemporal resolutions. We combined the efforts of engineers and scientists from multiple disciplines on this work which cannot be achieved by the knowledge of an individual discipline. To the best of our knowledge, such a modeling tool is absent in communities of hydrology and Earth Surface Processes. In tests from catchment‐, to regional‐, and then to continental‐scale using 25‐million to 1.6‐billion particles, EcoSLIM shows significant speedup and excellent parallel performance. Although we take EcoSLIM as an example here, the parallel framework is portable to other particle tracking models in Earth System Science, such as those in atmospheric and oceanic disciplines. The parallelized EcoSLIM is a promising tool to hydrologic community and Earth System Model developers for scientific exploration.
Key Points
Numerical models for large‐scale water age/quality simulations are absent in communities of hydrology and Earth Surface Processes
A parallel framework for accelerating Lagrangian particle tracking to continental‐scale on distributed, multi‐Graphics Processing Unit platforms is established
The parallelized particle tracking model, EcoSLIM, is a promising tool to accelerate our understanding of the terrestrial water cycle
Real-time accurate recommendation of large-scale recommender systems is a challenging task. Matrix factorization (MF), as one of the most accurate and scalable techniques to predict missing ratings, ...has become popular in the collaborative filtering (CF) community. Currently, stochastic gradient descent (SGD) is one of the most famous approaches for MF. However, it is non-trivial to parallelize SGD for large-scale CF MF problems due to the dependence on the user and item pair, which can cause parallelization over-writing. To remove the dependence on the user and item pair, we propose a multi-stream SGD (MSGD) approach, for which the update process is theoretically convergent. On that basis, we propose a Compute Unified Device Architecture (CUDA) parallelization MSGD (CUMSGD) approach. CUMSGD can obtain high parallelism and scalability on Graphic Processing Units (GPUs). On Tesla K20m and K40c GPUs, the experimental results show that CUMSGD outperforms prior works that accelerated MF on shared memory systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale CF problems, we propose multiple GPUs (multi-GPU) CUMSGD (MCUMSGD). The experimental results show that MCUMSGD can improve MSGD performance further. With a K20m GPU card, CUMSGD can be 5-10 times as fast compared with the state-of-the-art approaches on shared memory platform.
Display omitted
Ti6Al4V is one of the most widely used ternary alloys in additive manufacturing. Its mechanical properties are highly determined by the prior β phase. However, simulation of the ...ternary alloy solidified microstructure is limited for the lack of available methods. In this study, we carried out two-dimensional (2D) and three-dimensional (3D) simulations based on a new multi-component (MC) phase-field (PF) model by extending Karma’s binary PF theory. The correctness and accuracy of the MC model are validated by the Gibbs-Thomson relation at dendrite tip. With the equipment multiple GPUs, the highest speedup ratio more than 150 is obtained compared with serial programming running on one of the top CPUs and simulations with billion nodes can be completed within acceptable time. Using this model as benchmark, previous Ti6Al4V dendrite evolution using pseudo-binary PF model is investigated and found magnifying the driving force and growth artificially. 3D large-scale directional solidification simulations also shed light on dendrite merging which leads coarse primary dendrites. We believe that the GPU-accelerated MC PF framework provides an accurate and efficient insight in understanding the solidifying evolution for Ti6Al4V.
Image moments are used to capture image features. Moments are successfully used in object descriptions, recognition, and other applications. However, image representation and recognition using ...quaternion moments are compute-intensive processes. This obstacle makes this type of moment unsuitable for real-time or large-scale applications, despite their high accuracy. In this work, we expose the challenges to parallelizing quaternion moment computations and transform the sequential algorithm to a parallel-friendly counterpart. We propose a set of parallel implementations for parallelizing quaternion moment image representations on different parallel architectures. The proposed implementations target multicore CPU-only, GPU-only, and hybrid CPU–GPUs (with multicore CPU and multi-GPU) environments. The loop mitigation technique is proposed to boost the level of parallelism in massively parallel environments, balance the parallel workload, and reduce both the space complexity and synchronization overhead of the proposed implementations. The loop mitigation technique could be applicable to other applications with similar loop imbalances. Finally, experiments are performed to evaluate the proposed implementations. Applying a moment order of 60 on a color image of 1024×1024 pixels, the proposed implementation achieved, on four P100 GPUs and a CPU with 16 cores, speedup of 257× and 277× over the baseline performance on a single Intel Xeon E5-2609 CPU core for image reconstruction and quaternion moment computation, respectively. In addition, 180× speedup is achieved for color image watermarking application.
•We proposed an accurate parallel implementation for the QPCET moment.•We proposed an optimized prediction of the load-balancing of heterogeneous PUs.•We proposed loop mitigation technique boosts the level of parallelism.•We proposed multicore CPU, GPU, and CPU–GPU implementations for the QPCET moment.
A* search is a best-first search algorithm that is widely used in pathfinding and graph traversal. To meet the ever-increasing demand of performance, various high-performance architectures (e.g., ...multi-core CPU and GPU) have been explored to accelerate the A* search. However, the current GPU based A* search approaches are merely designed based on single-GPU architecture. Nowadays, the amount of data grows at an exponential rate, making it inefficient or even infeasible for the current A* to process the data sets entirely on a single GPU.
In this paper, we propose DA*, a parallel A* search algorithm based on the multi-GPU architecture. DA* enables the efficient acceleration of the A* algorithm using multiple GPUs with effective graph partitioning and data communication strategies. To make the most of the parallelism of multi-GPU architecture, in the state extension phase, we adopt the method of multiple priority queues for the open list, which allows multiple states being calculated in parallel. In addition, we use the parallel hashing of replacement and frontier search mechanism to address node duplication detection and memory bottlenecks respectively. The evaluation shows that DA* is effective and efficient in accelerating A* based computational tasks on the multi-GPU system. Compared to the state-of-the-art A* search algorithm based on a single GPU, our algorithm can achieve up to 3× performance speedup with four GPUs.
•The current A* algorithm based on single-GPU architecture fails to coordinate multiple GPUs for efficient acceleration.•We present DA*, a parallel A* search algorithm based on the multi-GPU architecture.•DA* employs different graph partitioning and data communication strategies dependent on graph types.•We adapt the algorithm execution flow of A* to the multi-GPU architecture through a set of novel techniques including utilizing multiple priority queues, parallel hashing of replacement and frontier search mechanism.
Based on the multi-GPU lattice Boltzmann method with the half-way bounce-back scheme, fully developed turbulent duct flows at the friction Reynolds numbers Reτ of 300, 600, 1,200, 1,500, 1,800, and ...2,000 were simulated. The parallel performance of multi-GPU lattice Boltzmann simulations is up to 300.162 GLUPS using 1.57 billion grids with 384 GPUs. The simulated friction factor f was consistent with other DNS and experiment results, as well as the Karman–Prandtl theoretical friction law, which verified a sufficient grid resolution Δ+≤3.3, and the LBGK model is stable for Δ+≤5 at high Reynolds numbers. The secondary flows were successfully captured, and turbulence statistics of root-mean-square (r.m.s.) velocity and Reynolds stress were analyzed. The two-point velocity correlation functions and turbulent energy spectra at different positions showed that secondary flows in the near-corner region changed spatial turbulence distribution. Multi-GPU lattice Boltzmann simulations with large grid scales can deal with turbulent square duct flows at high Reynolds numbers and show promise for high-fidelity and scale-resolving fluid dynamics.
•Turbulent duct flows are simulated with the multi-GPU lattice Boltzmann method.•Simulated friction factor f is close to the value of the Prandtl–Karman friction law.•The parallel performance of 1.57 billion grids with 384 GPUs is 300.162 GLUPS.•Turbulence and secondary flow in square duct flow are analyzed up to Reτ= 2,000.
As progress in electronic structure theoretical methods is made, ab initio molecular dynamics (MD) based on orbital-free density functional theory (OF-DFT) is becoming increasingly more successful at ...substituting the traditional, very accurate but computationally costly Kohn-Sham (KS) approach for simulations of matter at the challenging warm dense matter (WDM) regime. However, despite the significant cost alleviation of eliminating the dependence on the KS orbitals, OF-DFT MD runs require ∼102 to 103 CPU cores running for days, or even weeks, for simulations of systems comprised of 102 to 103 atoms, depending on thermodynamic conditions. We present Dragon, a multi-GPU OF-DFT MD code for fast and efficient simulations of WDM. With a relatively small allocation of resources (4 to 8 GPU devices) it can provide an order of magnitude speedup for simulations containing O(104) atoms and target systems composed of O(105) atoms at conditions within the WDM regime, which is currently outside the capabilities of CPU codes.
We present an efficient computational approach for simulating component transport within single-phase free flow in the boundary layer over porous media. A numerical model based on this approach is ...validated using experimental data generated in a climate-controlled wind tunnel coupled with a soil test bed. The developed modeling approach is based on a combination of the lattice Boltzmann method (LBM) for simulating the fluid flow and the mixed-hybrid finite element method (MHFEM) for solving constituent transport. Both those methods individually, as well as when coupled, are implemented entirely on a GPU accelerator in order to utilize its computational power and avoid the hardware limitations caused by slow communication between the GPU and CPU over the PCI-E bus. In order to utilize vast computational resources available on modern supercomputers, the implementation is extended for distributed multi-GPU computations based on domain decomposition and the Message Passing Interface (MPI). We describe the mathematical details behind the computational method, focusing primarily on the coupling mechanisms. The performance of the solver is demonstrated on a modern high-performance computing system. Flow and transport simulation results are validated and compared herein with experimental velocity and relative humidity measurements made above a flat partially saturated soil layer exposed to steady air flow. Model robustness and flexibility is demonstrated by introducing cuboidal bluff-bodies to the flow in several different experimental scenarios. The experimentally measured values are available in a publicly available dataset that can serve as a benchmark for future studies. Finally, we discuss potential improvements for the model as well as future experimental efforts.