This paper describes the automatically tuned linear algebra software (ATLAS) project, as well as the fundamental principles that underly it. ATLAS is an instantiation of a new paradigm in high ...performance library production and maintenance, which we term automated empirical optimization of software (AEOS); this style of library management has been created in order to allow software to keep pace with the incredible rate of hardware advancement inherent in Moore's Law. ATLAS is the application of this new paradigm to linear algebra software, with the present emphasis on the basic linear algebra subprograms (BLAS), a widely used, performance-critical, linear algebra kernel library.
This article presents various data redistribution methods for block-partitioned linear algebra algorithms operating on dense matrices that are distributed in a block-cyclic fashion. Because the ...algorithmic partitioning unit and the distribution blacking factor are most often chosen to be equal, severe alignment restrictions are induced on the operands, and optimal values with respect to performance are architecture dependent. The techniques presented in this paper redistribute data "on the fly," so that the user's data distribution blocking factor becomes independent from the architecture dependent algorithmic partitioning. These techniques are applied to the matrix-matrix multiplication operation. A performance analysis along with experimental results shows that alignment restrictions can then be removed and that high performance can be maintained across platforms independently from the user's data distribution blocking factor.
One of the main obstacles to the efficient solution of scientific problems is the problem of tuning software, both to the available architecture and to the user problem at hand. We describe ...approaches for obtaining tuned high-performance kernels and for automatically choosing suitable algorithms. Specifically, we describe the generation of dense and sparse Basic Linear Algebra Subprograms (BLAS) kernels, and the selection of linear solver algorithms. However, the ideas presented here extend beyond these areas, which can be considered proof of concept.
This article discusses the core factorization routines included in the ScaLAPACK library. These routines allow the factorization and solution of a dense system of linear equations via LU, QR, and ...Cholesky. They are implemented using a block cyclic data distribution, and are built using de facto standard kernels for matrix and vector operations (BLAS and its parallel counterpart PBLAS) and message passing communication (BLACS). In implementing the ScaLAPACK routines, a major objective was to parallelize the corresponding sequential LAPACK using the BLAS, BLACS, and PBLAS as building blocks, leading to straightforward parallel implementations without a significant loss in performance. We present the details of the implementation of the ScaLAPACK factorization routines, as well as performance and scalability results on the Intel iPSC/860, Intel Touchstone Delta, and Intel Paragon System.
The authors study the implementation of dense linear algebra kernels, such as matrix multiplication or linear system solvers, on heterogeneous networks of workstations. The uniform block-cyclic data ...distribution scheme commonly used for homogeneous collections of processors limits the performance of these linear algebra kernels on heterogeneous grids to the speed of the slowest processor. We present and study more sophisticated data allocation strategies that balance the load on heterogeneous platforms with respect to the performance of the processors. When targeting unidimensional grids, the load-balancing problem can be solved rather easily. When targeting two-dimensional grids, which are the key to scalability and efficiency for numerical kernels, the problem turns out to be surprisingly difficult. We formally state the 2D load-balancing problem and prove its NP-completeness. Next, we introduce a data allocation heuristic, which turns out to be very satisfactory: Its practical usefulness is demonstrated by MPI experiments conducted with a heterogeneous network of workstations.
In this paper we present several new fast and reliable algorithms for solving rank-deficient linear least squares problems by means of the complete orthogonal decomposition. Experimental comparison ...of our algorithms with the existing implementations in LAPACK on a wide range of platforms shows considerable performance improvements. Moreover, for full-rank matrices the performance of the new algorithm is very close to that ofthe fast method based on QR factorization, thus providing a valuable general tool for full-rank matrices, rank-deficient matrices, and those matrices with unknown rank.
Heuristics are one of the most important tools to guide search to solve combinatorial problems. They are often specifically designed for one single problem and require both expertise and ...implementation work. Generic frameworks like SAT or CSP have developed heuristics that obey general principles like first fail or are able to learn and adapt from the exploration of the search tree like Dom/wDeg. In SAT, the classic VSIDS heuristic falls into both categories. The question of whether it is possible to learn from solving existing problems has been addressed for a long time by portfolio solvers where the best heuristic is chosen by Machine Learning from hand-crafted features, and more recently with Deep Learning by embedding this knowledge into a Graph Neural Network (GNN). In this paper, we build upon the latter category by proposing a new heuristic based on Deep Reinforcement Learning using two GNNs with adversarial rewards. We show that our method reduces the number of fails to get the first solution by more than 50% compared to MiniSat. This work shows the advantages of this type of techniques to extract structural and contextual knowledge from past solving experience.