The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) ...applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. A major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. The Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.
•We developed a performance portable programming model (PM) for manycore devices.•Unifying parallel dispatch and data layout is mandatory for performance portability.•The Kokkos C++library implements this PM with pthreads, OpenMP, and CUDA back-ends.•Demonstrate Xeon Phi and NVIDIA GPU performance portability with mini-applications.•Recommend a strategy for legacy application codes to migrate to manycore.
The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. ...Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. them However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.
This paper presents the concept of using a computational framework for enabling rapid development of parallel adaptive multiphysics application codes. A computational framework supplies a software ...architecture along with a toolbox of advanced capabilities for the many mechanics-independent pieces of the software. These pieces include high-level concepts to support parallel communications, parallel transfer operators that support code coupling, and parallel mesh refinement and unrefinement services with dynamic load rebalancing. We describe these capabilities in the context of the SIERRA framework developed at Sandia National Laboratories. Numerical examples are given, demonstrating the use of framework services for developing a parallel coupled application and a parallel single-physics
h-adaptive application.
•Scalability modifications for a radiation transport simulation problem on 16,384 GPUs.•Kokkos portability modifications to efficiently execute small code loops on GPUs.•Running a single portable ...radiation transport codebase on CPUs, GPUs, and Xeon Phis.
High performance computing frameworks utilizing CPUs, Nvidia GPUs, and/or Intel Xeon Phis necessitate portable and scalable solutions for application developers. Nvidia GPUs in particular present numerous portability challenges with a different programming model, additional memory hierarchies, and partitioned execution units among streaming multiprocessors. This work presents modifications to the Uintah asynchronous many-task runtime and the Kokkos portability library to enable one single codebase for complex multiphysics applications to run across different architectures. Scalability and performance results are shown on multiple architectures for a globally coupled radiation heat transfer simulation, ranging from a single node to 16,384 Titan compute nodes.
The Hawaii Undersea Military Munitions Assessment Edwards, Margo H; Shjegstad, Sonia M.; Wilkens, Roy ...
Deep-sea research. Part II, Topical studies in oceanography,
June 2016, 2016-06-00, 20160601, Letnik:
128
Journal Article
Recenzirano
The Hawaii Undersea Military Munitions Assessment (HUMMA) is the most comprehensive deep-water investigation undertaken by the United States to look at sea-disposed chemical and conventional ...munitions. HUMMA׳s primary scientific objective is to bound, characterize and assess a historic deep-water munitions sea-disposal site to determine the potential impact of the ocean environment on sea-disposed munitions and of sea-disposed munitions on the ocean environment and those that use it. Between 2007 and 2012 the HUMMA team conducted four field programs, collecting hundreds of square kilometers of acoustic data for high-resolution seafloor maps, tens of thousands of digital images, hundreds of hours of video of individual munitions, hundreds of physical samples acquired within two meters of munitions casings, and a suite of environmental data to characterize the ocean surrounding munitions in the study area. Using these data we examined six factors in the study area: (1) the spatial extent and distribution of munitions; (2) the integrity of munitions casings; (3) whether munitions constituents could be detected in sediment, seawater or animals near munitions; (4) whether constituent levels at munitions sites differed significantly from levels at reference control sites; (5) whether statistically significant differences in ecological population metrics could be detected between the two types of sites; and (6) whether munitions constituents or their derivatives potentially pose an unacceptable risk to human health. Herein we provide a general overview of HUMMA including overarching goals, methodologies, physical characteristics of the study area, data collected and general results. Detailed results, conclusions and recommendations for future research are discussed in the accompanying papers included in this volume.
A new generation of scientific and engineering applications are being developed to support multiple coupled physics, adaptive meshes, and scaling in massively parallel environments. The capabilities ...required to support multiphysics, adaptivity, and massively parallel execution are individually complex and are especially challenging to integrate within a single application. Sandia National Laboratories has managed this challenge by consolidating these complex physics-independent capabilities into the Sierra Framework which is shared among a diverse set of application codes. The success of the Sierra Framework has been predicated on managing the integration of complex capabilities through a conceptual model based upon formal mathematical abstractions. Set theory is used to express and analyze the data structures, operations, and interactions of these complex capabilities. This mathematically based, conceptual modeling approach to managing complexity is not specific to the Sierra Framework-it is generally applicable to any scientific and engineering application framework.
Recent work using a heteroduplex tracking assay (HTA) to identify resident viral sequences has suggested that patients with infectious mononucleosis (IM) who are undergoing primary Epstein-Barr virus ...(EBV) infection frequently harbor different EBV strains. Here, we examine samples from patients with IM by use of a new Epstein-Barr nuclear antigen 2 HTA alongside the established latent membrane protein 1 HTA. Coresident allelic sequences were detected in ex vivo blood and throat wash samples from 13 of 14 patients with IM; most patients carried 2 or more type 1 strains, 1 patient carried 2 type 2 strains, and 1 patient carried both virus types. In contrast, coresident strains were detected in only 2 of 14 patients by in vitro B cell transformation, despite screening >20 isolates/patient. We infer that coacquisition of multiple strains is common in patients with IM, although only 1 strain tends to be rescued in vitro; whether nonrescued strains are present in low abundance or are transformation defective remains to be determined
Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the ...collection of modern manycore accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces (APIs), and performance requirements. The Kokkos Array programming model provides library-based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: (1) manycore compute devices each with its own memory space, (2) data parallel kernels and (3) multidimensional arrays. Kernel execution performance is, especially for NVIDIA® devices, extremely dependent on data access patterns. Optimal data access pattern can be different for different manycore devices – potentially leading to different implementations of computational kernels specialized for different devices. The Kokkos Array programming model supports performance-portable kernels by (1) separating data access patterns from computational kernels through a multidimensional array API and (2) introduce device-specific data access mappings when a kernel is compiled. An implementation of Kokkos Array is available through Trilinos Trilinos website, http://trilinos.sandia.gov/, August 2011.
Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the ...collection of modern many core accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces (APIs), and performance requirements. The Trilinos-Kokkos array programming model provides library based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: (1) there exists one or more many core compute devices each with its own memory space, (2) data parallel kernels are executed via parallel for and parallel reduce operations, and (3) kernels operate on multidimensional arrays. Kernel execution performance is, especially for NVIDIA R GPGPU devices, extremely dependent on data access patterns. An optimal data access pattern can be different for different many core devices -- potentially leading to different implementations of computational kernels specialized for different devices. The Trilinos-Kokkos programming model support performance-portable kernels by separating data access patterns from computational kernels through a multidimensional array API. Through this API device-specific mappings of multiindices to device memory are introduced into a computational kernel through compile-time polymorphism, i.e., without modification of the kernel.