The data volume and computation task of MIMO radar is huge; a very high-speed computation is necessary for its real-time processing. In this paper, we mainly study the time division MIMO radar signal ...processing flow, propose an improved MIMO radar signal processing algorithm, raising the MIMO radar algorithm processing speed combined with the previous algorithms, and, on this basis, a parallel simulation system for the MIMO radar based on the CPU/GPU architecture is proposed. The outer layer of the framework is coarse-grained with OpenMP for acceleration on the CPU, and the inner layer of fine-grained data processing is accelerated on the GPU. Its performance is significantly faster than the serial computing equipment, and satisfactory acceleration effects have been achieved in the CPU/GPU architecture simulation. The experimental results show that the MIMO radar parallel simulation system with CPU/GPU architecture greatly improves the computing power of the CPU-based method. Compared with the serial sequential CPU method, GPU simulation achieves a speedup of 130 times. In addition, the MIMO radar signal processing parallel simulation system based on the CPU/GPU architecture has a performance improvement of 13%, compared to the GPU-only method.
In this paper, the discontinuous Galerkin based high-order gas-kinetic schemes (DG-HGKS) are developed for the three-dimensional Euler and Navier–Stokes equations. Different from the traditional ...discontinuous Galerkin (DG) methods with Riemann solvers, the current method adopts a kinetic evolution process, which is provided by the integral solution of Bhatnagar–Gross–Krook (BGK) model. In the weak formulation of DG method, a time-dependent evolution function is provided, and both inviscid and viscous fluxes can be calculated uniformly. The temporal accuracy is achieved by the two-stage fourth-order discretization, and the second-order gas-kinetic solver is adopted for the fluxes over the cell interface and the fluxes inside a cell. Numerical examples, including accuracy tests and Taylor–Green vortex problem, are presented to validate the efficiency and accuracy of DG-HGKS. Both optimal convergence and super-convergence are achieved by the current scheme. The comparison between DG-HGKS and high-order gas-kinetic scheme with weighted essential non-oscillatory reconstruction (WENO-HGKS) is also given, and the numerical performances are comparable with the approximate number of degree of freedom. To accelerate the computation, the DG-HGKS is implemented with the graphics processing unit (GPU) using compute unified device architecture (CUDA). The obtained results are also compared with those calculated by the central processing units (CPU) code in terms of the computational efficiency. The speedup of GPU code suggests the potential of high-order gas-kinetic schemes for the large scale computation.
•DG-HGKS are developed for the three-dimensional Euler and Navier–Stokes equations.•Numerical examples are presented to validate the efficiency and accuracy of DG-HGKS.•To accelerate the computation, the DG-HGKS is implemented with GPU using CUDA.•The speedup of GPU code suggests the potential of HGKS for large scale computation.
In the latest High Efficiency Video Coding (HEVC) development, i.e., HEVC screen content coding extensions (HEVC-SCC), a hash-based inter-motion search/block matching scheme is adopted in the ...reference test model, which brings significant coding gains to code screen content. However, the hash table generation itself may take up to half the encoding time and is thus too complex for practical usage. In this paper, we propose a hierarchical hash design and the corresponding block matching scheme to significantly reduce the complexity of hash-based block matching. The hierarchical structure in the proposed scheme allows large block calculation to use the results of small blocks. Thus, we avoid redundant computation among blocks with different sizes, which greatly reduces complexity without compromising coding efficiency. The experimental results show that compared with the hash-based block matching scheme in the HEVC-SCC test model (SCM)-6.0, the proposed scheme reduces about 77% of hash processing time, which leads to 12% and 16% encoding time savings in random access (RA) and low-delay B coding structures. The proposed scheme has been adopted into the latest SCM. A parallel implementation of the proposed hash table generation on graphics processing unit (GPU) is also presented to show the high parallelism of the proposed scheme, which achieves more than 30 frames/s for 1080p sequences and 60 frames/s for 720p sequences. With the fast hash-based block matching integrated into x265 and the hash table generated on GPU, the encoder can achieve 11.8% and 14.0% coding gains on average for RA and low-delay P coding structures, respectively, for real-time encoding.
Frequent itemset mining is widely used as a fundamental data mining technique. However, as the data size increases, the relatively slow performances of the existing methods hinder its applicability. ...Although many sequential frequent itemset mining methods have been proposed, there is a clear limit to the performance that can be achieved using a single thread. To overcome this limitation, various parallel methods using multi-core CPU, multiple machine, or many-core graphic processing unit (GPU) approaches have been proposed. However, these methods still have drawbacks, including relatively slow performance, data size limitations, and poor scalability due to workload skewness. In this paper, we propose a fast GPU-based frequent itemset mining method called GMiner for large-scale data. GMiner achieves very fast performance by fully exploiting the computational power of GPUs and is suitable for large-scale data. The method performs mining tasks in a counterintuitive way: it mines the patterns from the first level of the enumeration tree rather than storing and utilizing the patterns at the intermediate levels of the tree. This approach is quite effective in terms of both performance and memory use in the GPU architecture. In addition, GMiner solves the workload skewness problem from which the existing parallel methods suffer; as a result, its performance increases almost linearly as the number of GPUs increases. Through extensive experiments, we demonstrate that GMiner significantly outperforms other representative sequential and parallel methods in most cases, by orders of magnitude on the tested benchmarks.
Graphics processing units (GPUs), originally developed for rendering real-time effects in computer games, now provide unprecedented computational power for scientific applications. In this paper, we ...develop a general purpose molecular dynamics code that runs entirely on a single GPU. It is shown that our GPU implementation provides a performance equivalent to that of fast 30 processor core distributed memory cluster. Our results show that GPUs already provide an inexpensive alternative to such clusters and discuss implications for the future.
An accelerated shooting and bouncing ray (SBR) method is presented in this letter to improve the efficiency of radar cross section prediction for electrically large and complex objects. The SBR ...method is inefficient because it needs to trace a large number of ray tubes. Therefore, its efficiency can be improved by reducing the amount of ray tubes and increasing the speed of ray tracing. In this letter, we use rays instead of ray tubes to reduce the amount of rays that need to be traced, and construct a virtual ray tube at the ray exit position for field integration. Moreover, ray tracing and integration operations are executed on the graphics processing unit to further improve the efficiency of SBR method. Several examples are designed to prove the effectiveness of the proposed method.
This paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree ...structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.
Modern GPUs feature an increasing number of streaming multiprocessors (SMs) to boost system throughput. How to construct an efficient and scalable network-on-chip (NoC) for future high-performance ...GPUs is particularly critical. Although a mesh network is a widely used NoC topology in manycore CPUs for scalability and simplicity reasons, it is ill-suited to GPUs because of the many-to-few-to-many traffic pattern observed in GPU-compute workloads. Although a crossbar NoC is a natural fit, it does not scale to large SM counts while operating at high frequency. In this paper, we propose the converge-diverge crossbar (CD-Xbar) network with round-robin routing and topology-aware concurrent thread array (CTA) scheduling. CD-Xbar consists of two types of crossbars, a local crossbar and a global crossbar. A local crossbar converges input ports from the SMs into so-called converged ports; the global crossbar diverges these converged ports to the last-level cache (LLC) slices and memory controllers. CD-Xbar provides routing path diversity through the converged ports. Round-robin routing and topology-aware CTA scheduling balance network traffic among the converged ports within a local crossbar and across crossbars, respectively. Compared to a mesh with the same bisection bandwidth, CD-Xbar reduces NoC active silicon area and power consumption by 52.5 and 48.5 percent, respectively, while at the same time improving performance by 13.9 percent on average. CD-Xbar performs within 2.9 percent of an idealized fully-connected crossbar. We further demonstrate CD-Xbar's scalability, flexibility and improved performance per Watt (by 17.1 percent) over state-of-the-art GPU NoCs which are highly customized and non-scalable.
•We implemented a particle filter and an auxiliary particle filter on a GeForce GTX TITAN GPU.•Both estimators are able to meet real-time operating constraints in a remelting process.•The fully ...adapted auxiliary particle filter is more efficient in this peaked-likelihood case.•The auxiliary particle filter is 40 times faster than the particle filter for the same accuracy.
Particle filters are nonlinear estimators that can be used to detect anomalies in manufacturing processes. Although promising, their high computational cost often prevents their implementation in real-time applications. Recently, the introduction of graphics processing units (GPUs) has enabled the acceleration of computationally intensive processes with their massive parallel capabilities. This article presents the acceleration of the particle filter and the auxiliary particle filter, two of the most important particle methods, on a GPU using NVIDIA CUDA technology. This is illustrated via simulation for a remelting process where the accelerated algorithms return accurate estimates while still being two orders of magnitude faster than the physical process even for calculations that involve millions of particles.
Ray tracing has been regarded as the next generation of mainstream image rendering technology for a long time because of the authenticity of its rendering effect, and it is a hot research point in ...the field of computer graphics. In recent years, academics and commercials have extensively researched real-time ray tracing. To promote the research of real-time ray tracing, this paper reviews, analyses, and summaries the related literature. Firstly, the concept, algorithms, and classification of acceleration structures are introduced. Three commercial graphics processing units (GPU) supporting ray tracing are introduced and the differences between them are compared. This paper summarizes the optimization of ray tracing from six aspects, ray packet, stackless traversal, ray reorder, wide BVH, denoising techniques, and real-time ray tracing combined with the artificial?neural network, and expounds on the advantages and disadvantages of the relevant specific methods. Based on the acceleration of the algorithms, the