In this article, we consider the problem of detecting multiple changepoints in large datasets. Our focus is on applications where the number of changepoints will increase as we collect more data: for ...example, in genetics as we analyze larger regions of the genome, or in finance as we observe time series over longer periods. We consider the common approach of detecting changepoints through minimizing a cost function over possible numbers and locations of changepoints. This includes several established procedures for detecting changing points, such as penalized likelihood and minimum description length. We introduce a new method for finding the minimum of such cost functions and hence the optimal number and location of changepoints that has a computational cost, which, under mild conditions, is linear in the number of observations. This compares favorably with existing methods for the same problem whose computational cost can be quadratic or even cubic. In simulation studies, we show that our new method can be orders of magnitude faster than these alternative exact methods. We also compare with the binary segmentation algorithm for identifying changepoints, showing that the exactness of our approach can lead to substantial improvements in the accuracy of the inferred segmentation of the data. This article has supplementary materials available online.
Full text
Available for:
BFBNIB, GIS, IJS, INZLJ, KISLJ, NMLJ, NUK, PNG, SAZU, UL, UM, UPUK, ZRSKP
Detecting changepoints in data sets with many variates is a data science challenge of increasing importance. Motivated by the problem of detecting changes in the incidence of terrorism from a global ...terrorism database, we propose a novel approach to multiple changepoint detection in multivariate time series. Our method, which we call SUBSET, is a model‐based approach which uses a penalised likelihood to detect changes for a wide class of parametric settings. We provide theory that guides the choice of penalties to use for SUBSET, and that shows it has high power to detect changes regardless of whether only a few variates or many variates change. Empirical results show that SUBSET out‐performs many existing approaches for detecting changes in mean in Gaussian data; additionally, unlike these alternative methods, it can be easily extended to non‐Gaussian settings such as are appropriate for modelling counts of terrorist events.
Full text
Available for:
BFBNIB, FZAB, GIS, IJS, IZUM, KILJ, NLZOH, NUK, OILJ, PILJ, SAZU, SBCE, SBMB, UL, UM, UPUK
We consider regression models where the underlying functional relationship between the response and the explanatory variable is modeled as independent linear regressions on disjoint segments. We ...present an algorithm for perfect simulation from the posterior distribution of such a model, even allowing for an unknown number of segments and an unknown model order for the linear regressions within each segment. The algorithm is simple, can scale well to large data sets, and avoids the problem of diagnosing convergence that is present with Monte Carlo Markov Chain (MCMC) approaches to this problem. We demonstrate our algorithm on standard denoising problems, on a piecewise constant AR model, and on a speech segmentation problem.
In recent years, various means of efficiently detecting changepoints have been proposed, with one popular approach involving minimizing a penalized cost function using dynamic programming. In some ...situations, these algorithms can have an expected computational cost that is linear in the number of data points; however, the worst case cost remains quadratic. We introduce two means of improving the computational performance of these methods, both based on parallelizing the dynamic programming approach. We establish that parallelization can give substantial computational improvements: in some situations the computational cost decreases roughly quadratically in the number of cores used. These parallel implementations are no longer guaranteed to find the true minimum of the penalized cost; however, we show that they retain the same asymptotic guarantees in terms of their accuracy in estimating the number and location of the changes.
Supplementary materials
for this article are available online.
Full text
Available for:
BFBNIB, GIS, IJS, KISLJ, NUK, PNG, UL, UM, UPUK
Abstract
We propose a novel stochastic model for the spread of antimicrobial-resistant bacteria in a population, together with an efficient algorithm for fitting such a model to sample data. We ...introduce an individual-based model for the epidemic, with the state of the model determining which individuals are colonised by the bacteria. The transmission rate of the epidemic takes into account both individuals’ locations, individuals’ covariates, seasonality, and environmental effects. The state of our model is only partially observed, with data consisting of test results from individuals from a sample of households. Fitting our model to data is challenging due to the large state space of our model. We develop an efficient SMC2 algorithm to estimate parameters and compare models for the transmission rate. We implement this algorithm in a computationally efficient manner by using the scale invariance properties of the underlying epidemic model. Our motivating application focuses on the dynamics of community-acquired extended-spectrum beta-lactamase-producing Escherichia coli and Klebsiella pneumoniae, using data collected as part of the Drivers of Resistance in Uganda and Malawi project. We infer the parameters of the model and learn key epidemic quantities such as the effective reproduction number, spatial distribution of prevalence, household cluster dynamics, and seasonality.
Full text
Available for:
BFBNIB, FZAB, GIS, IJS, KILJ, OILJ, SBCE, SBMB, UPUK
Responsible for the majority of bacterial gastroenteritis in the developed world, Campylobacter jejuni is a pervasive pathogen of humans and animals, but its evolution is obscure. In this paper, we ...exploit contemporary genetic diversity and empirical evidence to piece together the evolutionary history of C. jejuni and quantify its evolutionary potential. Our combined population genetics-phylogenetics approach reveals a surprising picture. Campylobacter jejuni is a rapidly evolving species, subject to intense purifying selection that purges 60% of novel variation, but possessing a massive evolutionary potential. The low mutation rate is offset by a large effective population size so that a mutation at any site can occur somewhere in the population within the space of a week. Recombination has a fundamental role, generating diversity at twice the rate of de novo mutation, and facilitating gene flow between C. jejuni and its sister species Campylobacter coli. We attempt to calibrate the rate of molecular evolution in C. jejuni based solely on within-species variation. The rates we obtain are up to 1,000 times faster than conventional estimates, placing the C. jejuni-C. coli split at the time of the Neolithic revolution. We weigh the plausibility of such recent bacterial evolution against alternative explanations and discuss the evidence required to settle the issue.
Standard MCMC methods can scale poorly to big data settings due to the need to evaluate the likelihood at each iteration. There have been a number of approximate MCMC algorithms that use sub-sampling ...ideas to reduce this computational burden, but with the drawback that these algorithms no longer target the true posterior distribution. We introduce a new family of Monte Carlo methods based upon a multidimensional version of the Zig-Zag process of Ann. Appl. Probab. 27 (2017) 846–882, a continuous-time piecewise deterministic Markov process. While traditional MCMC methods are reversible by construction (a property which is known to inhibit rapid convergence) the Zig-Zag process offers a flexible nonreversible alternative which we observe to often have favourable convergence properties. We show how the Zig-Zag process can be simulated without discretisation error, and give conditions for the process to be ergodic. Most importantly, we introduce a sub-sampling version of the Zig-Zag process that is an example of an exact approximate scheme, that is, the resulting approximate process still has the posterior as its stationary distribution. Furthermore, if we use a control-variate idea to reduce the variance of our unbiased estimator, then the Zig-Zag process can be super-efficient: after an initial preprocessing step, essentially independent samples from the posterior distribution are obtained at a computational cost which does not depend on the size of the data.
Full text
Available for:
BFBNIB, INZLJ, NMLJ, NUK, PNG, SAZU, UL, UM, UPUK, ZRSKP
On-line inference for multiple changepoint problems Fearnhead, Paul; Liu, Zhen
Journal of the Royal Statistical Society. Series B, Statistical methodology,
September 2007, Volume:
69, Issue:
4
Journal Article
Peer reviewed
Open access
We propose an on-line algorithm for exact filtering of multiple changepoint problems. This algorithm enables simulation from the true joint posterior distribution of the number and position of the ...changepoints for a class of changepoint models. The computational cost of this exact algorithm is quadratic in the number of observations. We further show how resampling ideas from particle filters can be used to reduce the computational cost to linear in the number of observations, at the expense of introducing small errors, and we propose two new, optimum resampling algorithms for this problem. One, a version of rejection control, allows the particle filter to choose the number of particles that are required at each time step automatically. The new resampling algorithms substantially outperform standard resampling algorithms on examples that we consider; and we demonstrate how the resulting particle filter is practicable for segmentation of human G+C content.
Full text
Available for:
BFBNIB, FZAB, GIS, IJS, INZLJ, IZUM, KILJ, NLZOH, NMLJ, NUK, OILJ, PILJ, PNG, SAZU, SBCE, SBMB, UL, UM, UPUK, ZRSKP
Due to the dimension and the dependency structure of genetic data, composite likelihood methods have found their natural place in the statistical methodology involving such data. After a brief ...description of the type of data one encounters in population genetic studies, we introduce the questions of interest concerning the main genetic parameters in population genetics, and present an up-to-date review on how composite likelihoods have been used to estimate these parameters.
Full text
Available for:
BFBNIB, NMLJ, NUK, PNG, UL, UM, UPUK