Developing efficient graph algorithm implementations is a significant important problem of modern computer science since graphs are frequently used in various real-world applications. Graph ...algorithms typically belong to the data-intensive class, and thus using architectures with high-bandwidth memory potentially allows to solve many graph problems signif-icantly faster than modern multicore CPUs. Among other su-percomputer architectures, SX-Aurora TSUBASA vector engines are equipped with high-bandwidth memory, and thus, potentially allow to accelerate various graph applications. However, very few existing graph processing frameworks can efficiently utilize SX-Aurora hardware capabilities. GraphBLAS standard proposes a convenient way to develop graph algorithms in terms of linear algebra and allows its users not todeeply understand vector architectures hardware features. This paper describes a world-first attempt to implement an optimized prototype of the GraphBLAS backend for SX-Aurora TSUBASA. Our backend prototype can achieve performance, comparable to the existing Vector Graph Library (VGL) based implementations for SX-Aurora TSUBASA, and also outperform the existing GraphBLAS backends for NVIDIA GPUs. We also discuss the roadmap and challenges for creating the full GraphBLAS implementation for SX-Aurora TSUBASA vector engines.
Molecular docking is a key method in computer-aided drug design, where the rapid identification of drug candidates is crucial for combating diseases. AutoDock is a widely-used molecular docking ...program, having an irregular structure characterized by a divergent control flow and compute-intensive calculations. This work investigates porting AutoDock to the SX-Aurora TSUBASA vector engine and evaluates the achievable performance on a number of real-world input compounds. In particular, we discuss the platform-specific coding styles required to handle the high degree of irregularity in both local-search methods employed by AutoDock. These Solis-Wets and ADADELTA methods take up a large part of the total computation time. Based on our experiments, we achieved runtimes on the SX-Aurora TSUBASA VE 20B that are on average 3 x faster than on modern dual-socket 64-core CPU nodes. Our solution is competitive with V100 GPUs, even though these already use newer chip fabrication technology (12 nm vs. 16 nm on the VE 20B).
The recent success of vector computers such as the Cray-1 and array processors such as those manufactured by Floating Point Systems has increased interest in making vector operations available to the ...FORTRAN programmer. The FORTRAN standards committee is currently considering a successor to FORTRAN 77, usually called FORTRAN 8x, that will permit the programmer to explicitly specify vector and array operations.
Although FORTRAN 8x will make it convenient to specify explicit vector operations in new programs, it does little for existing code. In order to benefit from the power of vector hardware, existing programs will need to be rewritten in some language (presumably FORTRAN 8x) that permits the explicit specification of vector operations. One way to avoid a massive manual recoding effort is to provide a translator that discovers the parallelism implicit in a FORTRAN program and automatically rewrites that program in FORTRAN 8x.
Such a translation from FORTRAN to FORTRAN 8x is not straightforward because FORTRAN DO loops are not always semantically equivalent to the corresponding FORTRAN 8x parallel operation. The semantic difference between these two constructs is precisely captured by the concept of
dependence
. A translation from FORTRAN to FORTRAN 8x preserves the semantics of the original program if it preserves the dependences in that program.
The theoretical background is developed here for employing data dependence to convert FORTRAN programs to parallel form. Dependence is defined and characterized in terms of the conditions that give rise to it; accurate tests to determine dependence are presented; and transformations that use dependence to uncover additional parallelism are discussed.
The last session of the HPPC 2009 workshop was dedicated to a panel discussion between the invited speakers and three additional, selected panelists. The theme of the panel was originally suggested ...by Uzi Vishkin, and developed with the moderator. A preamble was given in advance to the five panelists, and provoked an intensive and determined discussion. The panelists were given the chance to briefly summarize their view- and standpoints after the panel.
The solution of linear systems continues to play an important role in scientific computing. The problems to be solved often are of very large size, so that solving them requires large computer ...resources. To solve these problems, at least supercomputers with large shared memory or massive parallel computer systems with distributed memory are needed.
This paper gives a survey of research on parallel implementation of various direct methods to solve dense linear systems. In particular are considered: Gaussian elimination, Gauss-Jordan elimination and a variant due to Huard (1979), and an algorithm due to Enright (1978), designed in relation to solving (stiff) ODEs, such that stepsize and other method parameters can easily be varied.
Some theoretical results are mentioned, including a new result on error analysis of Huard's algorithm. Moreover, practical considerations and results of experiments on supercomputers and on a distributed-memory computer system are presented.
We report about a source-code modification of the density-functional program suite VASP which greatly benefits from the use of graphics-processing units (GPUs). The blocked Davidson iteration scheme ...(EDDAV) has been optimized for GPUs and gains speed-ups of up to 3.39 on S1070 devices and of 6.97 on a C2050 device. Using the Fermi card, the code reaches an impressive 61.7% efficiency but does not suffer from any accuracy losses. The algorithmic bottleneck lies in the multiplication of rectangular matrices. We also give some initial thoughts about introducing a different level of parallelism in order to harness the computational power of multi-GPU installations.
We discuss and evaluate three optimizations for reducing memory management overhead and data copying costs in SISAL 1.2 programs that build arrays. The first, called framework preconstruction, ...eliminates superfluous allocate-deallocate sequences in cyclic computations. The second, called aggregate storage subsumption, reduces the management overhead for compound array components. The third, called predictive storage preallocation, eliminates superfluous data copying in filtered array constructions and simplifies their parallelization. We have added all three optimizations to the Optimizing SISAL Compiler with rewarding improvements in SISAL program performance on vector-parallel machines such as those built by Cray Computer Corporation, Convex, and Cray Research.< >
Interleaved parallel schemes Seznec, A.; Lenfant, J.
IEEE transactions on parallel and distributed systems,
12/1994, Letnik:
5, Številka:
12
Journal Article
Recenzirano
On vector supercomputers, vector register processors share a global highly interleaved memory. In order to optimize memory throughput, a single-instruction, multiple-data (SIMD) synchronization mode ...may be used on vector sections. We present an interleaved parallel scheme (IPS). Using IPS ensures an equitable distribution of elements on a highly interleaved memory for a wide range of vector strides. Access to memory may be organized in such a way that conflicts are avoided on memory and on the interconnection network.< >
Aiming at realizing a high performance computing (HPC) cloud with vector supercomputers, this paper presents the world’s first prototype of wide-area vector meta computing infrastructure named a ...vector computing cloud by virtualizing remote vector computing resources as an HPC service over the Internet. The prototype system consists of two remote NEC SX-9 nodes connected through a long fat-pipe network (LFN), and each node is located at Tohoku University and Osaka University with a distance of 800 km. The vector computing cloud also provides a single sign on environment and jobs are automatically assigned to appropriate sites. Wide-area co-operation of distributed vector supercomputers is realized by adopting the NAREGI Grid Middleware, and a virtual machine for NEC SX vector supercomputer series, job scheduling algorithms, and an MPI operating environment are newly developed to enhance the job and resource management capabilities of the NAREGI Grid Middleware. In addition, to achieve fairness and efficient job scheduling on the vector computing cloud, this paper presents a history-based job scheduling mechanism for a queue system. Based on the estimation, the job scheduling mechanism automatically allocates the job to an appropriate site, which can execute the job earlier. The operation tests and experiment results indicate that the prototype system realizes single sign on for multiple vector resources and has enough potential for transparently operating jobs between the two SX-9 systems. This paper also evaluates and discusses the performance of the proposed job scheduling mechanism and MPI operation between both SX-9 systems using the High Performance Linpack (HPL) benchmark.