The kth smallest dissimilarity of a query point with respect to a given set is the dissimilarity that ranks number k when we sort, in increasing order, the dissimilarity value of the points in the ...set with respect to the query point. A multiple kth smallest dissimilarity query determines the kth smallest dissimilarity for several query points simultaneously. Although the problem of solving multiple kth smallest dissimilarity queries is an important primitive operation used in many areas, such as spatial data analysis, facility location, text classification and content-based image retrieval, it has not been previously addressed explicitly in the literature. In this paper we present three parallel strategies, to be run on a Graphics Processing Unit, for computing multiple kth smallest dissimilarity queries when non-metric dissimilarities, that do not satisfy the triangular inequality, are used. The strategies are theoretically and experimentally analyzed and compared among them and with an efficient sequential strategy to solve the problem.
•Multi-GPU parallelisation of scalable CG and multigrid solvers for anisotropic PDEs.•Efficient matrix-free CUDA implementation minimises global memory access on GPUs.•Excellent weak scaling on up to ...16,384 nVidia K20X cards (44 mio. cores, Titan, OLCF).•PDEs with 5x1011 unknowns can be solved in 1 s to an accuracy of 10−5.•Achieved performance of 0.78PFLOPs and memory bandwidth utilisation of more than 40%.
Memory bound applications such as solvers for large sparse systems of equations remain a challenge for GPUs. Fast solvers should be based on numerically efficient algorithms and implemented such that global memory access is minimised. To solve systems with trillions (O(1012)) unknowns the code has to make efficient use of several million individual processor cores on large GPU clusters.
We describe the multi-GPU implementation of two algorithmically optimal iterative solvers for anisotropic PDEs which are encountered in (semi-) implicit time stepping procedures in atmospheric modelling. In this application the condition number is large but independent of the grid resolution and both methods are asymptotically optimal, albeit with different absolute performance. In particular, an important constant in the discretisation is the CFL number; only the multigrid solver is robust to changes in this constant. We parallelise the solvers and adapt them to the specific features of GPU architectures, paying particular attention to efficient global memory access. We achieve a performance of up to 0.78 PFLOPs when solving an equation with 0.55 · 1012unknowns on 16384 GPUs; this corresponds to about 3% of the theoretical peak performance of the machine and we use more than 40% of the peak memory bandwidth with a Conjugate Gradient (CG) solver. Although the other solver, a geometric multigrid algorithm, has a slightly worse performance in terms of FLOPs per second, overall it is faster as it needs less iterations to converge; the multigrid algorithm can solve a linear PDE with half a trillion unknowns in about one second.
In this paper, a hybrid full-wave analysis of surface acoustic wave (SAW) devices is proposed to achieve accurate and fast simulation. The partial differential equation (PDE) models of the physical ...system in question and graphics processing unit (GPU)-assisted hierarchical cascading technology (HCT) are used to calculate acoustic-electric characteristics of a SAW filter. The practical solid model of the radio frequency (RF) filter package is constructed in High Frequency Structure Simulator (HFSS) software and the parasitic electromagnetics of the entire package is considered in the design process. The PDE-based models of the two-dimensional finite element method (2D-FEM) are derived in detail and solved by the PDE module embedded in COMSOL Multiphysics. Due to the advantages of PDE-based 2D-FEM, it is universal, efficient and not restricted to handling arbitrary materials and crystal cuts, electrode shapes, and multi-layered substrate. Combining COMSOL Multiphysics with a user-friendly interface, a flexible way of modeling and mesh generation, it can greatly reduce the complicated process of modeling and physical properties definition. Based on a hybrid full-wave analysis, we present an example application of this approach on a TC-SAW ladder filter with 5° YX-cut LiNbO
substrate. Numerical results and measurements were calculated for comparison, and the accuracy and efficiency of the proposed method were verified.
Optical Burst Switching (OBS) is a promising technology for next generation of Transparent Optical Networks (TON). However, many scientific challenges remain to be overcome such as the problem of ...Burst Routing and Wavelength Assignment (BRWA) with several conflicting objectives and constraints. In this paper, we first formulate the BRWA as a Multi Objective Integer Linear Programming (MO-ILP) optimization problem.
In the formulated problem, the proposed BRWA policy will satisfy several constraints in order to guarantee a high-speed management of processes, required by the transparent optical traffic. Then, since the obtained ILP problem contains a large number of optical constraints and conflicting objectives, we propose to use an exact parallel Neural Hierarchical (epNH) MO-ILP solution with Graphics Processing Unit (GPU) parallel implementation using Compute Unified Device Architecture (CUDA). This also allows doing a concurrent search for multiple solutions, reducing processing cost, making hybrid interfaces to other search techniques, and achieving better overall effectiveness.
In addition, our architecture based on Artificial Neural Networks (ANN) allows flexibility and scalability. The processing time remains fixed regardless of the input size. Our BRWA GPU-based epNH MO-ILP solver is based on the joint use of advanced MO-ILP optimization methods, ANN large-scale inherent parallelism and CUDA-GPU High-Performance Computing (HPC) architecture.
This paper presents an efficient, fine-grained parallel algorithm for solving the Eikonal equation on triangular meshes. The Eikonal equation, and the broader class of Hamilton-Jacobi equations to ...which it belongs, have a wide range of applications from geometric optics and seismology to biological modeling and analysis of geometry and images. The ability to solve such equations accurately and efficiently provides new capabilities for exploring and visualizing parameter spaces and for solving inverse problems that rely on such equations in the forward model. Efficient solvers on state-of-the-art, parallel architectures require new algorithms that are not, in many cases, optimal, but are better suited to synchronous updates of the solution. In previous work W. K. Jeong and R. T. Whitaker, SIAM J. Sci. Comput., 30 (2008), pp. 2512-2534, the authors proposed the fast iterative method (FIM) to efficiently solve the Eikonal equation on regular grids. In this paper we extend the fast iterative method to solve Eikonal equations efficiently on triangulated domains on the CPU and on parallel architectures, including graphics processors. We propose a new local update scheme that provides solutions of first-order accuracy for both architectures. We also propose a novel triangle-based update scheme and its corresponding data structure for efficient irregular data mapping to parallel single-instruction multiple-data (SIMD) processors. We provide detailed descriptions of the implementations on a single CPU, a multicore CPU with shared memory, and SIMD architectures with comparative results against state-of-the-art Eikonal solvers.
Celotno besedilo
Dostopno za:
CEKLJ, DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, UILJ, UKNU, UL, UM, UPUK
In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever ...attempted before. An adaptive optimization framework is presented to balance the workload distribution across the GPUs and CPUs with the negligible runtime overhead, resulting in the better performance than the static or the training partitioning methods. The CPU-GPU communication overhead is effectively hidden by a software pipelining technique, which is particularly useful for large memory-bound applications. Combined with other traditional optimizations, the Linpack we optimized using the adaptive optimization framework achieved 196.7 GFLOPS on a single compute element of TianHe-1. This result is 70.1% of the peak compute capability and 3.3 times faster than the result using the vendor's library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list released in November 2009.
We embed special function units (SFUs) in homogeneous stream processors (SPs) within a graphics processing unit (GPU), to improve its performance in running modern programmable shaders, which make ...poor use of a single-instruction multiple-data (SIMD) architecture. We also compact instructions, so as to reduce the size of the instruction memory, and reduce area requirements by using a partial SFU in SPs, and a lookup table which is shared between multiple SFUs. The result is an increase of 88% in utilization and a reduction in the normalized area-delay product of 27%, compared to a baseline SIMD architecture. We verified our architecture on an field-programmable gate-array evaluation platform with an ARM9 host processor and a full 3-D graphics pipeline.
Medical ultrasound imaging stands out from other modalities in providing real-time diagnostic capability at an affordable price while being physically portable. This article explores the suitability ...of using GPUs as the primary signal and image processors for future medical ultrasound imaging systems. A case study on synthetic aperture (SA) imaging illustrates the promise of using high-performance GPUs in such systems.
Characteristic mode analysis (CMA) is used in the design and analysis of a wide range of electromagnetic devices such as antennas and nanostructures. The implementation of CMA involves the evaluation ...of a large method of moments (MoM) complex impedance matrix at every frequency. In this work, we use different open-source software for the GPU acceleration of the CMA. This open-source software comprises a wide range of computer science numerical and machine learning libraries not typically used for electromagnetic applications. Specifically, this paper shows how these different Python-based libraries can optimize the computational time of the matrix operations that compose the CMA algorithm. Based on our computational experiments and optimizations, we propose an approach using a GPU platform that is able to achieve up to 16×× and 26×× speedup for the CMA processing of a single 15k ×× 15k MoM matrix of a perfect electric conductor scatterer and a single 30k ×× 30k MoM matrix of a dielectric scatterer, respectively. In addition to improving the processing speed of CMA, our approach provided the same accuracy as independent CMA simulations. The speedup, efficiency, and accuracy of our CMA implementation will enable the analysis of electromagnetic systems much larger than what was previously possible at a fraction of the computational time.
•Two GPU-based Runge Kutta Methods used to evaluate the burn-up in a PWR reactor.•Exhibit speed improvement exceeding 100 times over the sequential.•Exhibit precision and accuracy lower than 0.0001%.
...Fast and accurate simulations of isotopic inventory are of fundamental importance to the conceptual design, refueling, and disposal of fuel from nuclear power plants (NPP). The determination of the fuel’s isotopic composition requires a high computational effort arising from the complexity of solving the large system of coupled ordinary differential equations (ODE’s). The system of ODE’s is related to the physical mesh, fuel and coolant temperatures, time history of power, previous isotopic concentrations, and radioactive decay chains under analysis. This study surveys two methods used to simulate fuel burn-up on pressurized water reactors (PWR) implemented in Graphics Processor Unit (GPU). The accuracy of methods was also studied, by comparing the inventory simulation of one cycle burn-up looking for the benchmark obtained from Chebyshev Rational Approximation Method (CRAM), and shows the Square-Root-Mean-Error (SRME) less than 0.0001%. A performance comparison of sequential version and GPU methods exhibit a speed improvement exceeding one hundred times.