A new approach for the parallel computation of singular value decomposition (SVD) of matrix
A∈
C
m×n
is proposed. Contrary to the known algorithms that use a static cyclic ordering of subproblems ...simultaneously solved in one iteration step, the proposed implementation of the two-sided block-Jacobi method uses a dynamic ordering of subproblems. The dynamic ordering takes into account the actual status of matrix
A. In each iteration step, a set of the off-diagonal blocks is determined that reduces the Frobenius norm of the off-diagonal elements of
A as much as possible and, at the same time, can be annihilated concurrently. The solution of this task is equivalent to the solution of the maximum-weight perfect matching problem. The greedy algorithm for the efficient solution of this problem is presented. The computational experiments with both types of ordering, incorporated into the two-sided block-Jacobi method, were performed on an SGI – Cray Origin 2000 parallel computer using the Message Passing Interface (MPI). The results confirm that the dynamic ordering is much more efficient with regard to the amount of work required for the computation of SVD of a given accuracy than the static cyclic ordering.
Four new triangular systolic arrays for QZ matrix decomposition based on the block Givens and Householder transformations are described. Their performance is compared with the traditional point ...rotator and point reflector for the QR matrix decomposition. The comparison is based on the detailed discussion of the length of global time step Δ t which synchronizes the array, the number of required processors and the area defined by the vertical and horizontal links used in the array for the data pipelining. It turns out that the most efficient array for the QZ matrix decomposition is the one which implements the block Givens rotation with an encoder and decoder in the diagonal and nondiagonal processors, respectively. This array is almost perfectly balanced as opposed to the well known triangular array for QR matrix decompositon designed by W. M. Gentleman and H. T. Kung.
We analyse the fine-grained parallelism of the two-sided block-Jacobi algorithm for the singular value decomposition (SVD) of matrix A/spl isin/R/sup m/spl times/n/, m/spl ges/n. The algorithm ...involves the class CO of parallel orderings on the two-dimensional toroidal mesh with p processors. The mathematical background is based on the QR decomposition (QRD) of local data matrices and on the triangular Kogbetliantz algorithm (TKA) for local SVDs in the diagonal mesh processors. Subsequent updates of local matrices in the diagonal as well as nondiagonal mesh processors are required. WE show that all updates can be realized by orthogonal modified Givens rotations. These rotations can be efficiently pipelined in parallel in the horizontal and vertical rings of /spl radic/p processors through the toroidal mesh. For one mesh processor our solution requires O(m+n)/sup 2///sub p/ systolic processing elements (PEs). O(m/sup 2//p) local memory registers and O(m+n)/sup 2//p additional delay elements. The time complexity of our solution is O(m+n/sup 3/2//p/sup 3/4/)/spl Delta/ time steps per one global iteration where /spl Delta/ is the length of the global synchronization time step that is given by evaluation and application of two modified Givens rotations in TKA.
The loose part monitoring system (LPMS) in the primary circuit of the WWER-440/WWER-1000 as in any other PWR or BWR type of NPP is one of the fundamental and essential to safety diagnostic tools. The ...purpose of the system is to detect, localize and analyze detached or loose metal pieces. When the metal impact occurs it causes the mechanical vibration of the structures and sensors detect a burst signal. The aim of this work is to show the potential of using modern signal processing methods in enhancing the LPMS performance by solving the tasks of noise cancellation, time of arrival detection, discrimination between real and faulty alarms, and loose metal piece mass estimation.
One way, how to speed up the computation of the singular value decomposition of a given matrix
A
∈
C
m
×
n
,
m
⩾
n
, by the parallel two-sided block-Jacobi method, consists of applying some ...pre-processing steps that would concentrate the Frobenius norm near the diagonal. Such a concentration should hopefully lead to fewer outer parallel iteration steps needed for the convergence of the entire algorithm. It is shown experimentally, that the QR factorization with the complete column pivoting, optionally followed by the LQ factorization of the
R-factor, can lead to a substantial decrease of the number of outer parallel iteration steps, whereby the details depend on the condition number and on the distribution of singular values including their multiplicity. A subset of ill-conditioned matrices has been identified, for which the dynamic ordering becomes inefficient. Best results in numerical experiments performed on the cluster of personal computers were achieved for well-conditioned matrices with a multiple minimal singular value, where the number of parallel iteration steps was reduced by two orders of magnitude. However, the gain in speed, as measured by the total parallel execution time, depends decisively on the implementation of the distributed QR and LQ factorizations on a given parallel architecture. In general, the reduction of the total parallel execution time up to one order of magnitude has been achieved.
The parallel two-sided block-Jacobi singular value decomposition (SVD) algorithm with dynamic ordering, originally proposed in Parallel Comput. 28 (2002) 243–262, has been extended with respect to ...the blocking factor ℓ. Unlike the unique blocking factor ℓ=2
p in the original algorithm running on
p processors, the current blocking factor is a variable parameter that covers the values in two different regions––namely, ℓ=
p/
k and ℓ=2
kp for some integer
k. Two new parallel two-sided block-Jacobi SVD algorithms with dynamic ordering are described in detail. They arise in those two regions and differ in the logical data arrangement and communication complexity of the reordering step. For the case of ℓ=2
kp, it is proved that a designed point-to-point communication algorithm is optimal with respect to the amount of communication required per processor as well as to the amount of overall communication. Using the message passing programming model for distributed memory machines, new parallel block-Jacobi SVD algorithms were implemented on an SGI-Cray Origin 2000 parallel computer. Numerical experiments were performed on
p=12 and 24 processors using a set of six matrices of order 4000 and blocking factors ℓ, 2⩽ℓ⩽192. To achieve the minimal total parallel execution time, the use of a blocking factor ℓ∈{2,
p,2
p} can be recommended for matrices with distinct singular values. However, for matrices with a multiple minimal singular value, the total parallel execution time may monotonically increase with ℓ. In this case, the recommended Jacobi method with ℓ=2 is just the ScaLAPACK routine with some additional matrix multiplications, and it computes the SVD in one parallel iteration step.
Initially, parallel algorithms were designed by parallelising the existing sequential algorithms for frequently occurring problems on available parallel architectures.
More recently, parallel ...strategies have been identified and utilised resulting in many new parallel algorithms. However, the analysis of such techniques reveals that further strategies can be applied to increase the parallelism. One of these strategies, i.e., increasing the computational work in each processing node, can reduce the memory accesses and hence congestion in a shared memory multiprocessor system. Similarly, when network message passing is minimised in a distributed memory processor system, dramatic improvements in the performance of the algorithm ensue.
A frequently occurring computational problem in digital signal processing (DSP) is the solution of symmetric positive definite Toeplilz linear systems. The Levinson algorithm for solving such linear equations is where the Toeplitz matrix property is utilised in the elimination process of each element to advantage. However, it can be shown that in the Parallel Implicit Elimination (PIE) method where more than one element is eliminated simultaneously, the Toeplitz structure can again be utilised to advantage. This relatively simple strategy yields a reduction in accesses to shared memory or network message passing, resulting in a significant improvement in the performance of the algorithm 2,