Supervised compression of big data Joseph, V. Roshan; Mak, Simon
Statistical analysis and data mining,
June 2021, 2021-06-00, 20210601, Letnik:
14, Številka:
3
Journal Article
Recenzirano
The phenomenon of big data has become ubiquitous in nearly all disciplines, from science to engineering. A key challenge is the use of such data for fitting statistical and machine learning models, ...which can incur high computational and storage costs. One solution is to perform model fitting on a carefully selected subset of the data. Various data reduction methods have been proposed in the literature, ranging from random subsampling to optimal experimental design‐based methods. However, when the goal is to learn the underlying input–output relationship, such reduction methods may not be ideal, since it does not make use of information contained in the output. To this end, we propose a supervised data compression method called supercompress, which integrates output information by sampling data from regions most important for modeling the desired input–output relationship. An advantage of supercompress is that it is nonparametric—the compression method does not rely on parametric modeling assumptions between inputs and output. As a result, the proposed method is robust to a wide range of modeling choices. We demonstrate the usefulness of supercompress over existing data reduction methods, in both simulations and a taxicab predictive modeling application.
Because of the importance, limited supply, and perishable nature of blood products, effective management of blood collection is critical for high-quality healthcare delivery. In this paper, working ...closely with the American Red Cross (ARC), we study a blood collection problem focusing on whole blood that is to be processed into cryoprecipitate (cryo), a critical blood product for controlling massive hemorrhaging. In particular, we aim to determine when and from which mobile collection sites to collect blood for cryo production, such that the weekly collection target is met while the collection costs are minimized. The cryo collection problem imposes a unique challenge: if blood collected is to be processed into cryo units, it has to be processed within eight hours after collection, while this time limit is 24 hours for most other blood products. To analyze the cryo collection problem, we first develop a mathematical program to represent and compare two different blood collection business models, namely, the status quo nonsplit model and an alternative model we propose, which splits each collection window into two intervals and allows different types of collections in the two intervals. Then, we establish several structural properties of the proposed mathematical program and develop a near-optimal solution algorithm to determine the cryo collection schedules under each collection model. Our extensive computational analyses based on real data indicated that, compared with the status quo, our proposed collection model can significantly reduce total collection costs. Based on this significant potential impact, our proposed collection model has been implemented by the ARC Douglasville manufacturing facility, the largest ARC blood manufacturing facility supplying blood to about 120 hospitals in the southern United States. Field data from postimplementation indicated that our proposed solution has resulted in (i) reducing inconsistencies in supply of cryo collections, and (ii) an approximately 40% reduction in the per-unit collection cost for cryo. Because of this success, the ARC is now at the stage of rolling out our proposed solution approach to other regions in the nation.
Population Quasi-Monte Carlo Huang, Chaofan; Joseph, V. Roshan; Mak, Simon
Journal of computational and graphical statistics,
07/2022, Letnik:
31, Številka:
3
Journal Article
Recenzirano
Odprti dostop
Monte Carlo methods are widely used for approximating complicated, multidimensional integrals for Bayesian inference. Population Monte Carlo (PMC) is an important class of Monte Carlo methods, which ...adapts a population of proposals to generate weighted samples that approximate the target distribution. When the target distribution is expensive to evaluate, PMC may encounter computational limitations since it requires many evaluations of the target distribution. To address this, we propose a new method, Population Quasi-Monte Carlo (PQMC), which integrates Quasi-Monte Carlo ideas within the sampling and adaptation steps of PMC. A key novelty in PQMC is the idea of importance support points resampling, a deterministic method for finding an "optimal" subsample from the weighted proposal samples. Moreover, within the PQMC framework, we develop an efficient covariance adaptation strategy for multivariate normal proposals. Finally, a new set of correction weights is introduced for the weighted PMC estimator to improve the efficiency from the standard PMC estimator. We demonstrate the improved empirical performance of PQMC over PMC in extensive numerical simulations and a friction drilling application.
Supplementary materials
for this article are available online.
Statistical models are commonly used in quality-improvement studies. However, such models tend to perform poorly when predictions are made away from the observed data points. On the other hand, ...engineering models derived using the underlying physics of the process do not always match satisfactorily with reality. This article proposes engineering-statistical models that overcome the disadvantages of engineering models and statistical models. The engineering-statistical model is obtained through some adjustments to the engineering model using experimental data. The adjustments are done in a sequential way and are based on empirical Bayes methods. We also develop approximate frequentist procedures for adjustments that are computationally much easier to implement. The usefulness of the methodology is illustrated using a problem of predicting surface roughness in a microcutting process and the optimization of a spot-welding process.
Space-filling properties are important in designing computer experiments. The traditional maximin and minimax distance designs consider only space-filling in the full-dimensional space; this can ...result in poor projections onto lower-dimensional spaces, which is undesirable when only a few factors are active. Restricting maximin distance design to the class of Latin hypercubes can improve one-dimensional projections but cannot guarantee good space-filling properties in larger subspaces. We propose designs that maximize space-filling properties on projections to all subsets of factors. We call our designs maximum projection designs. Our design criterion can be computed at no more cost than a design criterion that ignores projection properties.
Inverse distance weighting (IDW) is a simple method for multivariate interpolation but has poor prediction accuracy. In this article we show that the prediction accuracy of IDW can be substantially ...improved by integrating it with a linear regression model. This new predictor is quite flexible, computationally efficient, and works well in problems having high dimensions and/or large datasets. We also develop a heuristic method for constructing confidence intervals for prediction. This article has supplementary material online.
ORTHOGONAL GAUSSIAN PROCESS MODELS Plumlee, Matthew; Joseph, V. Roshan
Statistica Sinica,
04/2018, Letnik:
28, Številka:
2
Journal Article
Recenzirano
Odprti dostop
Gaussian processes models are widely adopted for nonparameteric/semi-parametric modeling. Identifiability issues occur when the mean model contains polynomials with unknown coefficients. Though ...resulting prediction is unaffected, this leads to poor estimation of the coefficients in the mean model, and thus the estimated mean model loses interpretability. This paper introduces a new Gaussian process model whose stochastic part is orthogonal to the mean part to address this issue. This paper also discusses applications to multi-fidelity simulations using data examples.
Space-filling designs such as Latin hypercube designs (LHDs) are widely used in computer experiments. However, finding an optimal LHD with good space-filling properties is computationally cumbersome. ...On the other hand, the well-established factorial designs in physical experiments are unsuitable for computer experiments owing to the redundancy of design points when projected onto a subset of factor space. In this work, we present a new class of space-filling designs developed by splitting two-level factorial designs into multiple layers. The method takes advantage of many available results in factorial design theory and therefore, the proposed multi-layer designs (MLDs) are easy to generate. Moreover, our numerical study shows that MLDs can have better space-filling properties than optimal LHDs.
Technometrics started in 1959 and quickly established itself as a premier journal of statistics in physical, chemical, and engineering sciences. It is jointly published by the American Statistical ...Association and American Society for Quality. Technometrics is one of the oldest journals of statistics. Here, Joseph discusses the editorial polices and manuscript processing of Technometrics.
Computational modeling is a popular tool to understand a diverse set of complex systems. The output from a computational model depends on a set of parameters that are unknown to the designer, but a ...modeler can estimate them by collecting physical data. In the described study of the ion channels of ventricular myocytes, the parameter of interest is a function as opposed to a scalar or a set of scalars. This article develops a new modeling strategy to nonparametrically study the functional parameter using Bayesian inference with Gaussian process prior distributions. A new sampling scheme is devised to address this unique problem.