Exact Recovery in the Stochastic Block Model Abbe, Emmanuel; Bandeira, Afonso S.; Hall, Georgina
IEEE transactions on information theory,
2016-Jan., 2016-1-00, 20160101, Letnik:
62, Številka:
1
Journal Article
Recenzirano
Odprti dostop
The stochastic block model with two communities, or equivalently the planted bisection model, is a popular model of random graph exhibiting a cluster behavior. In the symmetric case, the graph has ...two equally sized clusters and vertices connect with probability p within clusters and q across clusters. In the past two decades, a large body of literature in statistics and computer science has focused on providing lower bounds on the scaling of | p - q| to ensure exact recovery. In this paper, we identify a sharp threshold phenomenon for exact recovery: if α = pn/log(n) and β = qn/ log(n) are constant (with α > β), recovering the communities with high probability is possible if (α + β/2) - √(αβ) > 1 and is impossible if (α + β/2) - √(αβ) <; 1. In particular, this improves the existing bounds. This also sets a new line of sight for efficient clustering algorithms. While maximum likelihood (ML) achieves the optimal threshold (by definition), it is in the worst case NP-hard. This paper proposes an efficient algorithm based on a semidefinite programming relaxation of ML, which is proved to succeed in recovering the communities close to the threshold, while numerical experiments suggest that it may achieve the threshold. An efficient algorithm that succeeds all the way down to the threshold is also obtained using a partial recovery algorithm combined with a local improvement procedure.
We obtain nonasymptotic bounds on the spectral norm of random matrices with independent entries that improve significantly on earlier results. If X is the n × n symmetric matrix with $X_{ij} \sim ...N(0, b^{2}_{ij})$, we show that $E\left \| X \right \| \stackrel{\textless}{\sim } \underset{i}{\text{max}} \sqrt{\sum _{j} b^{2}_{ij}} + \underset{ij}{\text{max}}\left | b_{ij} \right | \sqrt{\log n}$. This bound is optimal in the sense that a matching lower bound holds under mild assumptions, and the constants are sufficiently sharp that we can often capture the precise edge of the spectrum. Analogous results are obtained for rectangular matrices and for more general sub-Gaussian or heavy-tailed distributions of the entries, and we derive tail bounds in addition to bounds on the expected norm. The proofs are based on a combination of the moment method and geometric functional analysis techniques. As an application, we show that our bounds immediately yield the correct phase transition behavior of the spectral edge of random band matrices and of sparse Wigner matrices. We also recover a result of Seginer on the norm of Rademacher matrices.
Maximum likelihood estimation problems are, in general, intractable optimization problems. As a result, it is common to approximate the maximum likelihood estimator (MLE) using convex relaxations. In ...some cases, the relaxation is tight: it recovers the true MLE. Most tightness proofs only apply to situations where the MLE exactly recovers a planted solution (known to the analyst). It is then sufficient to establish that the optimality conditions hold at the planted signal. In this paper, we study an estimation problem (angular synchronization) for which the MLE is not a simple function of the planted solution, yet for which the convex relaxation is tight. To establish tightness in this context, the proof is less direct because the point at which to verify optimality conditions is not known explicitly. Angular synchronization consists in estimating a collection of
n
phases, given noisy measurements of the pairwise relative phases. The MLE for angular synchronization is the solution of a (hard) non-bipartite Grothendieck problem over the complex numbers. We consider a stochastic model for the data: a planted signal (that is, a ground truth set of phases) is corrupted with non-adversarial random noise. Even though the MLE does not coincide with the planted signal, we show that the classical semidefinite relaxation for it is tight, with high probability. This holds even for high levels of noise.
The largest eigenvalue of a matrix is always larger or equal than its largest diagonal entry. We show that for a class of random Laplacian matrices with independent off-diagonal entries, this bound ...is essentially tight: the largest eigenvalue is, up to lower order terms, often the size of the largest diagonal. entry. Besides being a simple tool to obtain precise estimates on the largest eigenvalue of a class of random Laplacian matrices, our main result settles a number of open problems related to the tightness of certain convex relaxation-based algorithms. It easily implies the optimality of the semidefinite relaxation approaches to problems such as
Z
2
Synchronization and stochastic block model recovery. Interestingly, this result readily implies the connectivity threshold for Erdős–Rényi graphs and suggests that these three phenomena are manifestations of the same underlying principle. The main tool is a recent estimate on the spectral norm of matrices with independent entries by van Handel and the author.
OPTIMALITY AND SUB-OPTIMALITY OF PCA I Perry, Amelia; Wein, Alexander S.; Bandeira, Afonso S. ...
The Annals of statistics,
10/2018, Letnik:
46, Številka:
5
Journal Article
Recenzirano
Odprti dostop
A central problem of random matrix theory is to understand the eigenvalues of spiked random matrix models, introduced by Johnstone, in which a prominent eigenvector (or “spike”) is planted into a ...random matrix. These distributions form natural statistical models for principal component analysis (PCA) problems throughout the sciences. Baik, Ben Arous and Péché showed that the spiked Wishart ensemble exhibits a sharp phase transition asymptotically: when the spike strength is above a critical threshold, it is possible to detect the presence of a spike based on the top eigenvalue, and below the threshold the top eigenvalue provides no information. Such results form the basis of our understanding of when PCA can detect a low-rank signal in the presence of noise. However, under structural assumptions on the spike, not all information is necessarily contained in the spectrum. We study the statistical limits of tests for the presence of a spike, including nonspectral tests. Our results leverage Le Cam’s notion of contiguity and include:
(i) For the Gaussian Wigner ensemble, we show that PCA achieves the optimal detection threshold for certain natural priors for the spike.
(ii) For any non-Gaussian Wigner ensemble, PCA is sub-optimal for detection. However, an efficient variant of PCA achieves the optimal threshold (for natural priors) by pre-transforming the matrix entries.
(iii) For the Gaussian Wishart ensemble, the PCA threshold is optimal for positive spikes (for natural priors) but this is not always the case for negative spikes.
Many important geometric estimation problems naturally take the form of synchronization over the special Euclidean group: estimate the values of a set of unknown group elements
x
1
,
…
,
x
n
∈
SE
(
d
...)
given noisy measurements of a subset of their pairwise relative transforms
x
i
−
1
x
j
. Examples of this class include the foundational problems of pose-graph simultaneous localization and mapping (SLAM) (in robotics), camera motion estimation (in computer vision), and sensor network localization (in distributed sensing), among others. This inference problem is typically formulated as a non-convex maximum-likelihood estimation that is computationally hard to solve in general. Nevertheless, in this paper we present an algorithm that is able to efficiently recover certifiably globally optimal solutions of the special Euclidean synchronization problem in a non-adversarial noise regime. The crux of our approach is the development of a semidefinite relaxation of the maximum-likelihood estimation (MLE) whose minimizer provides an exact maximum-likelihood estimate so long as the magnitude of the noise corrupting the available measurements falls below a certain critical threshold; furthermore, whenever exactness obtains, it is possible to verify this fact a posteriori, thereby certifying the optimality of the recovered estimate. We develop a specialized optimization scheme for solving large-scale instances of this semidefinite relaxation by exploiting its low-rank, geometric, and graph-theoretic structure to reduce it to an equivalent optimization problem defined on a low-dimensional Riemannian manifold, and then design a Riemannian truncated-Newton trust-region method to solve this reduction efficiently. Finally, we combine this fast optimization approach with a simple rounding procedure to produce our algorithm, SE-Sync. Experimental evaluation on a variety of simulated and real-world pose-graph SLAM datasets shows that SE-Sync is capable of recovering certifiably globally optimal solutions when the available measurements are corrupted by noise up to an order of magnitude greater than that typically encountered in robotics and computer vision applications, and does so significantly faster than the Gauss–Newton-based approach that forms the basis of current state-of-the-art techniques.
A good understanding of the sources of economic growth is fundamental. Total factor productivity (TFP) – frequently taken as synonymous of technical change but computed as a residual – is credited as ...a major driver of growth. Despite considerable efforts, the sources of TFP growth are still poorly understood. As all economic processes necessarily entail the conversion of energy, aggregate energy (conversion) efficiency is a plausible candidate to measure technical change. Here, we test the hypothesis that there is a statistically significant long-term relationship between TFP and aggregate energy efficiency (measured as final-to-useful aggregate exergy efficiency), for Portugal, 1960–2014. Our results strongly suggest that these variables are cointegrated, with unit elasticity between technical change and energy efficiency. This link is stronger when TFP is computed considering the more theoretically correct capital services and schooling-corrected labour measures. Since we also find that aggregate energy efficiency Granger-causes TFP, we can legitimately claim that the former is the main driver of economic growth. As thermodynamic laws impose that aggregate energy efficiency is upper bounded at 100%, our findings also suggest there are thermodynamic limits on growth. This carries implications for models currently used to inform policymakers on economic growth, sustainable development, and climate change scenarios.
•Energy efficiency a plausible candidate to explain total factor productivity (TFP).•Energy efficiency adequately measured as final-to-useful exergy efficiency.•We statistically test the relationship linking aggregate energy efficiency and TFP.•For Portugal, 1960–2014, energy efficiency is unit elastic driver of TFP and growth.•Implications for economic growth modelling, scenarios, and climate action.
A central tool in the study of nonhomogeneous random matrices, the noncommutative Khintchine inequality, yields a nonasymptotic bound on the spectral norm of general Gaussian random matrices
X
=
∑
i
...g
i
A
i
where
g
i
are independent standard Gaussian variables and
A
i
are matrix coefficients. This bound exhibits a logarithmic dependence on dimension that is sharp when the matrices
A
i
commute, but often proves to be suboptimal in the presence of noncommutativity. In this paper, we develop nonasymptotic bounds on the spectrum of arbitrary Gaussian random matrices that can capture noncommutativity. These bounds quantify the degree to which the spectrum of
X
is captured by that of a noncommutative model
X
free
that arises from free probability theory. This “intrinsic freeness” phenomenon provides a powerful tool for the study of various questions that are outside the reach of classical methods of random matrix theory. Our nonasymptotic bounds are easily applicable in concrete situations, and yield sharp results in examples where the noncommutative Khintchine inequality is suboptimal. When combined with a linearization argument, our bounds imply strong asymptotic freeness for a remarkably general class of Gaussian random matrix models that may be very sparse, have dependent entries, and lack any special symmetries. When combined with a universality principle, our bounds extend beyond the Gaussian setting to general sums of independent random matrices.