Display omitted
The ability to do thermal simulations for entire additive manufacturing builds is a key computational problem facing the additive manufacturing community; however, complex numerical ...models considering multiple physical phenomena currently do not have the capacity for simulations at this scale. To this end, conduction only analytic models offer a viable approach due to the massive drop in computational expense. Here, we extend an existing implementation which uses a governing equation which can be evaluated at any point in space and time. This implementation already utilizes OpenMP with a spatial decompositions scheme stemming from a melt pool tracking algorithm. We then combine this with a parallel in time (PinT) approach to make the problem highly parallelizable. The new scheme, which uses MPI for internode communication and OpenMP for intranode communication, is shown to scale very well across multiple computational nodes. This approach results in the ability to simulate the 3D solidification conditions for entire layers of additively manufactured parts in minutes making part scale thermal simulations more practical.
Many applications in high performance computing are designed based on underlying performance and execution models. While these models could successfully be employed in the past for balancing load ...within and between compute nodes, modern software and hardware increasingly make performance predictability difficult if not impossible. Consequently, balancing computational load becomes much more difficult. Aiming to tackle these challenges in search for a general solution, we present a novel library for fine-granular task-based reactive load balancing in distributed memory based on MPI and OpenMP. With our approach, individual migratable tasks can be executed on any MPI rank. The actual executing rank is determined at run time based on online performance data. We evaluate our approach under an enforced power cap and under enforced clock frequency changes for a synthetic benchmark and show its robustness for work-induced imbalances for a realistic application. Our experiments demonstrate speedups of up to 1.31X.
•Increasing dynamic variability observable in modern hardware and software.•Performance prediction and load balancing difficult or even impossible.•Novel library for fine-granular task-based load balancing in hybrid MPI+OpenMP codes.•Reactive task migration concept based on online performance data.•Evaluation shows performance improvements for hardware and work-induced imbalances.
► We study the hybrid MPI
+
OpenMP approach to programming multi-core parallel systems. ► The hybrid approach is compared with pure MPI using benchmarks and full applications. ► Case studies show ...advantages and issues of the approach on modern parallel systems. ► We propose new extensions to OpenMP to better handle data locality on NUMA systems.
The rapidly increasing number of cores in modern microprocessors is pushing the current high performance computing (HPC) systems into the petascale and exascale era. The hybrid nature of these systems – distributed memory across nodes and shared memory with non-uniform memory access within each node – poses a challenge to application developers. In this paper, we study a hybrid approach to programming such systems – a combination of two traditional programming models, MPI and OpenMP. We present the performance of standard benchmarks from the multi-zone NAS Parallel Benchmarks and two full applications using this approach on several multi-core based systems including an SGI Altix 4700, an IBM p575+ and an SGI Altix ICE 8200EX. We also present new data locality extensions to OpenMP to better match the hierarchical memory structure of multi-core architectures.
Magnetic Resonance Imaging (MRI) is a highly demanded medical imaging system due to high resolution, large volumetric coverage, and ability to capture the dynamic and functional information of body ...organs e.g. cardiac MRI is employed to assess cardiac structure and evaluate blood flow dynamics through the cardiac valves. Long scan time is the main drawback of MRI, which makes it difficult for the patients to remain still during the scanning process.
By collecting fewer measurements, MRI scan time can be shortened, but this undersampling causes aliasing artifacts in the reconstructed images. Advanced image reconstruction algorithms have been used in literature to overcome these undersampling artifacts. These algorithms are computationally expensive and require a long time for reconstruction which makes them infeasible for real-time clinical applications e.g. cardiac MRI. However, exploiting the inherent parallelism in these algorithms can help to reduce their computation time.
Low-rank plus sparse (L+S) matrix decomposition model is a technique used in literature to reconstruct the highly undersampled dynamic MRI (dMRI) data at the expense of long reconstruction time. In this paper, Compressed Singular Value Decomposition (cSVD) model is used in L+S decomposition model (instead of conventional SVD) to reduce the reconstruction time. The results provide improved quality of the reconstructed images. Furthermore, it has been observed that cSVD and other parts of the L+S model possess highly parallel operations; therefore, a customized GPU based parallel architecture of the modified L+S model has been presented to further reduce the reconstruction time.
Four cardiac MRI datasets (three different cardiac perfusion acquired from different patients and one cardiac cine data), each with different acceleration factors of 2, 6 and 8 are used for experiments in this paper. Experimental results demonstrate that using the proposed parallel architecture for the reconstruction of cardiac perfusion data provides a speed-up factor up to 19.15× (with memory latency) and 70.55× (without memory latency) in comparison to the conventional CPU reconstruction with no compromise on image quality.
The proposed method is well-suited for real-time clinical applications, offering a substantial reduction in reconstruction time.
An upgraded version of the Particle and Heavy Ion Transport code System, PHITS2.52, was developed and released to the public. The new version has been greatly improved from the previously released ...version, PHITS2.24, in terms of not only the code itself but also the contents of its package, such as the attached data libraries. In the new version, a higher accuracy of simulation was achieved by implementing several latest nuclear reaction models. The reliability of the simulation was improved by modifying both the algorithms for the electron-, positron-, and photon-transport simulations and the procedure for calculating the statistical uncertainties of the tally results. Estimation of the time evolution of radioactivity became feasible by incorporating the activation calculation program DCHAIN-SP into the new package. The efficiency of the simulation was also improved as a result of the implementation of shared-memory parallelization and the optimization of several time-consuming algorithms. Furthermore, a number of new user-support tools and functions that help users to intuitively and effectively perform PHITS simulations were developed and incorporated. Due to these improvements, PHITS is now a more powerful tool for particle transport simulation applicable to various research and development fields, such as nuclear technology, accelerator design, medical physics, and cosmic-ray research.
This work presents our efforts to implement an MPI/OpenMP hybrid parallelization of the LIGGGHTS open-source software package for Discrete Element Methods (DEM). We outline the problems encountered ...and the solutions implemented to achieve scalable performance using both parallelization models. Three case studies, including two real-world applications with up to 1.5 million particles, were evaluated and demonstrate the practicality of this approach. In these examples, better load balancing and reduced MPI communication led to speed increases of up to 44% compared to MPI-only simulations.
Display omitted
•An MPI/OpenMP hybrid parallelization was implemented into the LIGGGHTS DEM code.•We present three case studies to illustrate strengths and weaknesses of the approach.•The hybrid parallelization simplifies load balancing and reduces MPI communication.•MPI-only simulations are matched or outperformed by up to 44% in typical test cases.
This work describes an implementation of OpenMP, MPI, hybrid OpenMP/MPI parallelization strategies for an implicit three-dimensional (3D) direct discontinuous Galerkin (DDG) solver used for ...Navier–Stokes equations. Significantly, an efficient local matrix-based MPI parallelization strategy of an implicit DG solver is proposed. Furthermore, we give an implementation of OpenMP and hybrid OpenMP/MPI strategy for comparison. The storage structure based on the local matrix makes the MPI parallelization can easily use the local number to access and assign, the storage is compact and compatible with Block Sparse Row (BSR) format, and the program is easier to modularize. Several numerical tests for 3D Navier–Stokes equations are implemented to indicate the performance of parallelization strategies. For the problem of more than 200 million degrees of freedom, the designed pure MPI strategy for 3rd-order DDG solvers with 2nd-order polynomials (DDG(P2)) can get parallel efficiency of almost 90% at near ten thousand cores on the Tianhe-2 supercomputer. In particular, the pure MPI parallelization based on local matrix reaches a higher level of parallel efficiency than hybrid OpenMP/MPI parallelization.
•We give OpenMP, MPI, Hybrid OpenMP/MPI parallel strategies for 3D DDG solvers.•We propose an local-matrix-based MPI strategy for an implicit DDG solver.•The strategy holds compactness, high efficiency, convenience, and modularity.•The pure MPI strategy gets the highest parallel efficiency in comparison.•A MPI DDG(P2) solver gets a parallel efficiency of 89.5 at 9216 cores on Tianhe-2.