Optimal ratio for data splitting Joseph, V. Roshan
Statistical analysis and data mining,
August 2022, 2022-08-00, 20220801, Letnik:
15, Številka:
4
Journal Article
Recenzirano
Odprti dostop
It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training ...and testing. In this article, we show that the optimal training/testing splitting ratio is p:1$$ \sqrt{p}:1 $$, where p$$ p $$ is the number of parameters in a linear regression model that explains the data well.
SUPPORT POINTS Mak, Simon; Joseph, V. Roshan
The Annals of statistics,
12/2018, Letnik:
46, Številka:
6A
Journal Article
Recenzirano
Odprti dostop
This paper introduces a new way to compact a continuous probability distribution F into a set of representative points called support points. These points are obtained by minimizing the energy ...distance, a statistical potential measure initially proposed by Székely and Rizzo InterStat 5 (2004) 1–6 for testing goodness-of-fit. The energy distance has two appealing features. First, its distance-based structure allows us to exploit the duality between powers of the Euclidean distance and its Fourier transform for theoretical analysis. Using this duality, we show that support points converge in distribution to F, and enjoy an improved error rate to Monte Carlo for integrating a large class of functions. Second, the minimization of the energy distance can be formulated as a difference-of-convex program, which we manipulate using two algorithms to efficiently generate representative point sets. In simulation studies, support points provide improved integration performance to both Monte Carlo and a specific quasi-Monte Carlo method. Two important applications of support points are then highlighted: (a) as a way to quantify the propagation of uncertainty in expensive simulations and (b) as a method to optimally compact Markov chain Monte Carlo (MCMC) samples in Bayesian computation.
•An efficient surrogate model is built from a computationally expensive spherical indentation finite element model.•Unknown constitutive parameters are extracted from instrumented spherical ...indentation experiments via Bayesian inference.•Constitutive parameter uncertainties are quantified through the established posterior probability densities.•Numerical and experimental data are utilized to test the efficacy of the proposed method.
Display omitted
Instrumented indentation enables rapid characterization of mechanical behavior in small material volumes. The heterogeneous deformation fields beneath the indenter however make it difficult to infer the intrinsic constitutive properties (e.g., Young's modulus, yield strength). This inverse problem is addressed in the literature using optimization techniques that are generally unable to yield robust values for the properties of interest and cannot quantify property uncertainty. Furthermore, current approaches tend to exhibit very high sensitivity to the error definitions and the optimization techniques employed. In order to overcome these difficulties, we propose an alternate approach that involves two main steps: (i) Development of a Gaussian Process (or kriging) surrogate model using finite element models of spherical indentation, and (ii) inverse solution using a Bayesian framework and Markov Chain Monte Carlo sampling. These approaches are demonstrated using selected case studies.
We discuss the problem of approximating a deterministic function using Gaussian processes (GPs). The role of transformation in GP modeling is not well understood. We argue that transformation of the ...response can be used for making the deterministic function approximately additive, which can then be easily estimated using an additive GP. We call such a GP a transformed additive Gaussian (TAG) process. To capture possible interactions which are unaccounted for in an additive model, we propose an extension of the TAG process called transformed approximately additive Gaussian (TAAG) process. We develop efficient techniques for fitting a TAAG process. In fact, we show that it can be fitted to high-dimensional data much more efficiently than a standard GP. Furthermore, we show that the use of the TAAG process leads to better estimation, interpretation, visualization, and prediction. The proposed methods are implemented in the R package TAG.
In this article, we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of support points (SP), which was initially ...developed for finding the optimal representative points of a continuous distribution. We adapt SP for subsampling from a dataset using a sequential nearest neighbor algorithm. We also extend SP to deal with categorical variables so that SPlit can be applied to both regression and classification problems. The implementation of SPlit on real datasets shows substantial improvement in the worst-case testing performance for several modeling methods compared to the commonly used random splitting procedure.
Improving the quality of a product/process using a computer simulator is a much less expensive option than the real physical testing. However, simulation using computationally intensive computer ...models can be time consuming and, therefore, directly doing the optimization on the computer simulator can be infeasible. Experimental design and statistical modeling techniques can be used to overcome this problem. This article reviews experimental designs known as space-filling designs that are suitable for computer simulations. In the article, a special emphasis is given for a recently developed space-filling design called maximum projection design. Its advantages are illustrated using a simulation conducted for optimizing a milling process.
Data Twinning Vakayil, Akhil; Joseph, V. Roshan
Statistical analysis and data mining,
October 2022, 2022-10-00, 20221001, Letnik:
15, Številka:
5
Journal Article
Recenzirano
Odprti dostop
In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model‐independent method for ...optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures and k‐fold cross validation.
Understanding the uncertainty in simulation outputs is important for careful decision-making regarding a machining process. However, Monte Carlo-based methods cannot be used for evaluating the ...uncertainty when the simulations are computationally expensive. An alternative approach is to build an easy-to-evaluate emulator to approximate the computer model and run the Monte Carlo simulations on the emulator. Although this approach is very promising, it becomes inefficient when the computer model is highly nonlinear and the region of interest is large. Most machining simulations are of this kind because the output is affected by several quantitative factors-such as the workpiece material properties, cutting tool parameters, and process parameters whose effects can change depending on other qualitative factors such as the type of materials, tool designs, and tool paths. Because the number of levels of the qualitative factors can range from tens to thousands, building an accurate emulator is not an easy task. This article proposes a new approach, called an in situ emulator, to overcome this problem. The idea is to build an emulator for the user-specified levels of the qualitative factors and inside the local region defined by the input uncertainty distribution of the quantitative factors. Efficient experimental design and statistical modeling techniques are used for constructing the in situ emulator. The approach is illustrated using the simulations of two solid end milling processes.