Principal components analysis and, more generally, the Singular Value Decomposition are fundamental data analysis tools that express a data matrix in terms of a sequence of orthogonal or uncorrelated ...vectors of decreasing importance. Unfortunately, being linear combinations of up to all the data points, these vectors are notoriously difficult to interpret in terms of the data and processes generating the data. In this article, we develop CUR matrix decompositions for improved data analysis. CUR decompositions are low-rank matrix decompositions that are explicitly expressed in terms of a small number of actual columns and/or actual rows of the data matrix. Because they are constructed from actual data elements, CUR decompositions are interpretable by practitioners of the field from which the data are drawn (to the extent that the original data are). We present an algorithm that preferentially chooses columns and rows that exhibit high "statistical leverage" and, thus, in a very precise statistical sense, exert a disproportionately large "influence" on the best low-rank fit of the data matrix. By selecting columns and rows in this manner, we obtain improved relative-error and constant-factor approximation guarantees in worst-case analysis, as opposed to the much coarser additive-error guarantees of prior work. In addition, since the construction involves computing quantities with a natural and widely studied statistical interpretation, we can leverage ideas from diagnostic regression analysis to employ these matrix decompositions for exploratory data analysis.
Identifying variants associated with complex traits is a challenging task in genetic association studies due to linkage disequilibrium (LD) between genetic variants and population stratification, ...unrelated to the disease risk. Existing methods of population structure correction use principal component analysis or linear mixed models with a random effect when modeling associations between a trait of interest and genetic markers. However, due to stringent significance thresholds and latent interactions between the markers, these methods often fail to detect genuinely associated variants. To overcome this, we propose CluStrat, which corrects for complex arbitrarily structured populations while leveraging the linkage disequilibrium induced distances between genetic markers. It performs an agglomerative hierarchical clustering using the Mahalanobis distance covariance matrix of the markers. In simulation studies, we show that our method outperforms existing methods in detecting true causal variants. Applying CluStrat on WTCCC2 and UK Biobank cohorts, we found biologically relevant associations in Schizophrenia and Myocardial Infarction. CluStrat was also able to correct for population structure in polygenic adaptation of height in Europeans. CluStrat highlights the advantages of biologically relevant distance metrics, such as the Mahalanobis distance, which captures the cryptic interactions within populations in the presence of LD better than the Euclidean distance.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
We study the problem of approximating the eigenspectrum of a symmetric matrix
A
∈
R
n
×
n
with bounded entries (i.e.,
‖
A
‖
∞
≤
1
). We present a simple sublinear time algorithm that approximates all ...eigenvalues of
A
up to additive error
±
ϵ
n
using those of a randomly sampled
O
~
log
3
n
ϵ
3
×
O
~
log
3
n
ϵ
3
principal submatrix. Our result can be viewed as a concentration bound on the complete eigenspectrum of a random submatrix, significantly extending known bounds on just the singular values (the magnitudes of the eigenvalues). We give improved error bounds of
±
ϵ
nnz
(
A
)
and
±
ϵ
‖
A
‖
F
when the rows of
A
can be sampled with probabilities proportional to their sparsities or their squared
ℓ
2
norms respectively. Here
nnz
(
A
)
is the number of non-zero entries in
A
and
‖
A
‖
F
is its Frobenius norm. Even for the strictly easier problems of approximating the singular values or testing the existence of large negative eigenvalues (Bakshi, Chepurko, and Jayaram, FOCS ’20), our results are the first that take advantage of non-uniform sampling to give improved error bounds. From a technical perspective, our results require several new eigenvalue concentration and perturbation bounds for matrices with bounded entries. Our non-uniform sampling bounds require a new algorithmic approach, which judiciously zeroes out entries of a randomly sampled submatrix to reduce variance, before computing the eigenvalues of that submatrix as estimates for those of
A
. We complement our theoretical results with numerical simulations, which demonstrate the effectiveness of our algorithms in practice.
In many applications, the data consist of (or may be naturally formulated as) an $m \times n$ matrix $A$. It is often of interest to find a low-rank approximation to $A$, i.e., an approximation $D$ ...to the matrix $A$ of rank not greater than a specified rank $k$, where $k$ is much smaller than $m$ and $n$. Methods such as the singular value decomposition (SVD) may be used to find an approximation to $A$ which is the best in a well-defined sense. These methods require memory and time which are superlinear in $m$ and $n$; for many applications in which the data sets are very large this is prohibitive. Two simple and intuitive algorithms are presented which, when given an $m \times n$ matrix $A$, compute a description of a low-rank approximation $D^{*}$ to $A$, and which are qualitatively faster than the SVD. Both algorithms have provable bounds for the error matrix $A-D^{*}$. For any matrix $X$, let $\|{X}\|_F$ and $\|{X}\|_2$ denote its Frobenius norm and its spectral norm, respectively. In the first algorithm, $c$ columns of $A$ are randomly chosen. If the $m \times c$ matrix $C$ consists of those $c$ columns of $A$ (after appropriate rescaling), then it is shown that from $C^TC$ approximations to the top singular values and corresponding singular vectors may be computed. From the computed singular vectors a description $D^{*}$ of the matrix $A$ may be computed such that $\mathrm{rank}(D^{*}) \le k$ and such that $$ \left\|A-D^{*}\right\|_{\xi}^{2} \le \min_{D:\mathrm{rank}(D)\le k} \left\|A-D\right\|_{\xi}^{2} + poly(k,1/c) \left\|{A}\right\|^2_F $$ holds with high probability for both $\xi = 2,F$. This algorithm may be implemented without storing the matrix $A$ in random access memory (RAM), provided it can make two passes over the matrix stored in external memory and use $O(cm+c^2)$ additional RAM. The second algorithm is similar except that it further approximates the matrix $C$ by randomly sampling $r$ rows of $C$ to form a $r \times c$ matrix $W$. Thus, it has additional error, but it can be implemented in three passes over the matrix using only constant additional RAM. To achieve an additional error (beyond the best rank $k$ approximation) that is at most $\epsilon\|{A}\|^2_F$, both algorithms take time which is polynomial in $k$, $1/\epsilon$, and $\log(1/\delta)$, where $\delta>0$ is a failure probability; the first takes time linear in $\mbox{max}(m,n)$ and the second takes time independent of $m$ and $n$. Our bounds improve previously published results with respect to the rank parameter $k$ for both the Frobenius and spectral norms. In addition, the proofs for the error bounds use a novel method that makes important use of matrix perturbation theory. The probability distribution over columns of $A$ and the rescaling are crucial features of the algorithms which must be chosen judiciously.
Celotno besedilo
Dostopno za:
CEKLJ, DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Motivated by applications in which the data may be formulated as a matrix, we consider algorithms for several common linear algebra problems. These algorithms make more efficient use of computational ...resources, such as the computation time, random access memory (RAM), and the number of passes over the data, than do previously known algorithms for these problems. In this paper, we devise two algorithms for the matrix multiplication problem. Suppose $A$ and $B$ (which are $m\times n$ and $n\times p$, respectively) are the two input matrices. In our main algorithm, we perform $c$ independent trials, where in each trial we randomly sample an element of $\{ 1,2,\ldots, n\}$ with an appropriate probability distribution ${\cal P}$ on $\{ 1,2,\ldots, n\}$. We form an $m\times c$ matrix $C$ consisting of the sampled columns of $A$, each scaled appropriately, and we form a $c\times n$ matrix $R$ using the corresponding rows of $B$, again scaled appropriately. The choice of ${\cal P}$ and the column and row scaling are crucial features of the algorithm. When these are chosen judiciously, we show that $CR$ is a good approximation to $AB$. More precisely, we show that $$ \left\|AB-CR\right\|_F = O(\left\|A\right\|_F \left\|B\right\|_F /\sqrt c) , $$ where $\|\cdot\|_F$ denotes the Frobenius norm, i.e., $\|A\|^2_F=\sum_{i,j}A_{ij}^2$. This algorithm can be implemented without storing the matrices $A$ and $B$ in RAM, provided it can make two passes over the matrices stored in external memory and use $O(c(m+n+p))$ additional RAM to construct $C$ and $R$. We then present a second matrix multiplication algorithm which is similar in spirit to our main algorithm. In addition, we present a model (the pass-efficient model) in which the efficiency of these and other approximate matrix algorithms may be studied and which we argue is well suited to many applications involving massive data sets. In this model, the scarce computational resources are the number of passes over the data and the additional space and time required by the algorithm. The input matrices may be presented in any order of the entries (and not just row or column order), as is the case in many applications where, e.g., the data has been written in by multiple agents. In addition, the input matrices may be presented in a sparse representation, where only the nonzero entries are written.
Celotno besedilo
Dostopno za:
CEKLJ, DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Recent neuroimaging studies have shown that functional connectomes are unique to individuals, i.e., two distinct fMRIs taken over different sessions of the same subject are more similar in terms of ...their connectomes than those from two different subjects. In this study, we present new results that identify specific parts of resting state and task-specific connectomes that are responsible for the unique signatures. We show that a very small part of the connectome can be used to derive features for discriminating between individuals. A network of these features is shown to achieve excellent training and test accuracy in matching imaging datasets. We show that these features are statistically significant, robust to perturbations, invariant across populations, and are localized to a small number of structural regions of the brain. Furthermore, we show that for task-specific connectomes, the regions identified by our method are consistent with their known functional characterization. We present a new matrix sampling technique to derive computationally efficient and accurate methods for identifying the discriminating sub-connectome and support all of our claims using state-of-the-art statistical tests and computational techniques.
Abstract
Motivation
Principal Component Analysis is a key tool in the study of population structure in human genetics. As modern datasets become increasingly larger in size, traditional approaches ...based on loading the entire dataset in the system memory (Random Access Memory) become impractical and out-of-core implementations are the only viable alternative.
Results
We present TeraPCA, a C++ implementation of the Randomized Subspace Iteration method to perform Principal Component Analysis of large-scale datasets. TeraPCA can be applied both in-core and out-of-core and is able to successfully operate even on commodity hardware with a system memory of just a few gigabytes. Moreover, TeraPCA has minimal dependencies on external libraries and only requires a working installation of the BLAS and LAPACK libraries. When applied to a dataset containing a million individuals genotyped on a million markers, TeraPCA requires <5 h (in multi-threaded mode) to accurately compute the 10 leading principal components. An extensive experimental analysis shows that TeraPCA is both fast and accurate and is competitive with current state-of-the-art software for the same task.
Availability and implementation
Source code and documentation are both available at https://github.com/aritra90/TeraPCA.
Supplementary information
Supplementary data are available at Bioinformatics online.
The increasing volume and complexity of high-throughput genomic data make analysis and prioritization of variants difficult for researchers with limited bioinformatics skills. Variant Ranker allows ...researchers to rank identified variants and determine the most confident variants for experimental validation.
We describe Variant Ranker, a user-friendly simple web-based tool for ranking, filtering and annotation of coding and non-coding variants. Variant Ranker facilitates the identification of causal variants based on novelty, effect and annotation information. The algorithm implements and aggregates multiple prediction algorithm scores, conservation scores, allelic frequencies, clinical information and additional open-source annotations using accessible databases via ANNOVAR. The available information for a variant is transformed into user-specified weights, which are in turn encoded into the ranking algorithm. Through its different modules, users can (i) rank a list of variants (ii) perform genotype filtering for case-control samples (iii) filter large amounts of high-throughput data based on user custom filter requirements and apply different models of inheritance (iv) perform downstream functional enrichment analysis through network visualization. Using networks, users can identify clusters of genes that belong to multiple ontology categories (like pathways, gene ontology, disease categories) and therefore expedite scientific discoveries. We demonstrate the utility of Variant Ranker to identify causal genes using real and synthetic datasets. Our results indicate that Variant Ranker exhibits excellent performance by correctly identifying and ranking the candidate genes CONCLUSIONS: Variant Ranker is a freely available web server on http://paschou-lab.mbg.duth.gr/Software.html . This tool will enable users to prioritise potentially causal variants and is applicable to a wide range of sequencing data.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
•A two sample Mendelian randomization approach was used to identify proteins, metabolites, or microbes that have a putative causal association with subcortical brain structure volumes.•Eleven ...proteins and six metabolites were found to have a significant association with subcortical structure volumes, with nine proteins and five metabolites being replicated using independent exposure data.•Heterogeneity and pleiotropy analysis showed low to no deviation from null thus validating our associations as truly significant.•The study highlighted the role of proteolytic and anti-oxidative components in the development and functioning of the brain.•The results provide novel insight for understanding subcortical brain structure changes and could help in uncovering potential diagnostic markers and drug targets for the many disorders that are associated with changes in brain structures.
Alterations in subcortical brain structure volumes have been found to be associated with several neurodegenerative and psychiatric disorders. At the same time, genome-wide association studies (GWAS) have identified numerous common variants associated with brain structure. In this study, we integrate these findings, aiming to identify proteins, metabolites, or microbes that have a putative causal association with subcortical brain structure volumes via a two-sample Mendelian randomization approach. This method uses genetic variants as instrument variables to identify potentially causal associations between an exposure and an outcome. The exposure data that we analyzed comprised genetic associations for 2994 plasma proteins, 237 metabolites, and 103 microbial genera. The outcome data included GWAS data for seven subcortical brain structure volumes including accumbens, amygdala, caudate, hippocampus, pallidum, putamen, and thalamus. Eleven proteins and six metabolites were found to have a significant association with subcortical structure volumes, with nine proteins and five metabolites replicated using independent exposure data. We found causal associations between accumbens volume and plasma protease c1 inhibitor as well as strong association between putamen volume and Agouti signaling protein. Among metabolites, urate had the strongest association with thalamic volume. No significant associations were detected between the microbial genera and subcortical brain structure volumes. We also observed significant enrichment for biological processes such as proteolysis, regulation of the endoplasmic reticulum apoptotic signaling pathway, and negative regulation of DNA binding. Our findings provide insights to the mechanisms through which brain volumes may be affected in the pathogenesis of neurodevelopmental and psychiatric disorders and point to potential treatment targets for disorders that are associated with subcortical brain structure volumes.
In many applications, the data consist of (or may be naturally formulated as) an $m \times n$ matrix $A$ which may be stored on disk but which is too large to be read into random access memory (RAM) ...or to practically perform superlinear polynomial time computations on it. Two algorithms are presented which, when given an $m \times n$ matrix $A$, compute approximations to $A$ which are the product of three smaller matrices, $C$, $U$, and $R$, each of which may be computed rapidly. Let $A' = CUR$ be the computed approximate decomposition; both algorithms have provable bounds for the error matrix $A-A'$. In the first algorithm, $c$ columns of $A$ and $r$ rows of $A$ are randomly chosen. If the $m \times c$ matrix $C$ consists of those $c$ columns of $A$ (after appropriate rescaling) and the $r \times n$ matrix $R$ consists of those $r$ rows of $A$ (also after appropriate rescaling), then the $c \times r$ matrix $U$ may be calculated from $C$ and $R$. For any matrix $X$, let $\|X\|_F$ and $\|X\|_2$ denote its Frobenius norm and its spectral norm, respectively. It is proven that $$ \left\|A-A'\right\|_\xi \le \min_{D:\mathrm{rank}(D)\le k} \left\|A-D\right\|_\xi + poly(k,1/c) \left\|A\right\|_F $$ holds in expectation and with high probability for both $\xi = 2,F$ and for all $k=1,\ldots,\mbox{rank}(A)$; thus by appropriate choice of $k$ $$ \left\|A-A'\right\|_2 \le \epsilon \left\|A\right\|_F $$ also holds in expectation and with high probability. This algorithm may be implemented without storing the matrix $A$ in RAM, provided it can make two passes over the matrix stored in external memory and use $O(m+n)$ additional RAM (assuming that $c$ and $r$ are constants, independent of the size of the input). The second algorithm is similar except that it approximates the matrix $C$ by randomly sampling a constant number of rows of $C$. Thus, it has additional error but it can be implemented in three passes over the matrix using only constant additional RAM. To achieve an additional error (beyond the best rank-$k$ approximation) that is at most $\epsilon \|A\|_F$, both algorithms take time which is a low-degree polynomial in $k$, $1/\epsilon$, and $1/\delta$, where $\delta>0$ is a failure probability; the first takes time linear in $\mbox{max}(m,n)$ and the second takes time independent of $m$ and $n$. The proofs for the error bounds make important use of matrix perturbation theory and previous work on approximating matrix multiplication and computing low-rank approximations to a matrix. The probability distribution over columns and rows and the rescaling are crucial features of the algorithms and must be chosen judiciously.
Celotno besedilo
Dostopno za:
CEKLJ, DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK