We introduce a new estimator for the vector of coefficients β in the linear model y = Xβ + z, where X has dimensions n × p with p possibly larger than n. SLOPE, short for Sorted L-One Penalized ...Estimation, is the solution to $_{b \in {\mathbb{R}^p}}^{\min }\frac{1}{2}||y - Xb||_{{l^2}}^2 + {\lambda _1}|b{|_{(1)}} + {\lambda _2}|b{|_{(2)}} + \cdot \cdot \cdot + {\lambda _p}|b{|_{(p)}}$ where ${\lambda _1} \geqslant {\lambda _2} \geqslant \cdot \cdot \cdot \geqslant {\lambda _p} \geqslant 0and|b{|_{(1)}} \geqslant |b{|_{(2)}} \geqslant \cdot \cdot \cdot \geqslant |b{|_{(p)}}$ are the decreasing absolute values of the entries of b. This is a convex program and we demonstrate a solution algorithm whose computational complexity is roughly comparable to that of classical l₁ procedures such as the Lasso. Here, the regularizer is a sorted l₁ norm, which penalizes the regression coefficients according to their rank: the higher the rank—that is, stronger the signal—the larger the penalty. This is similar to the Benjamini and Hochberg J. Roy. Statist. Soc. Ser. B 57 (1995) 289-300 procedure (BH) which compares more significant p-values with more stringent thresholds. One notable choice of the sequence {λi} is given by the BH critical values λBH = z(1 — i · q/2p), where q ∊ (0, 1) and z(α) is the quantile of a standard normal distribution. SLOPE aims to provide finite sample guarantees on the selected model; of special interest is the false discovery rate (FDR), defined as the expected proportion of irrelevant regressors among all selected predictors. Under orthogonal designs, SLOPE with λBH provably controls FDR at level q. Moreover, it also appears to have appreciable inferential properties under more general designs X while having substantial power, as demonstrated in a series of experiments running on both simulated and real data.
Full text
Available for:
BFBNIB, INZLJ, NMLJ, NUK, PNG, UL, UM, UPUK, ZRSKP
Most epidemiology textbooks that discuss models are vague on details of model selection. This lack of detail may be understandable since selection should be strongly influenced by features of the ...particular study, including contextual (prior) information about covariates that may confound, modify, or mediate the effect under study. It is thus important that authors document their modeling goals and strategies and understand the contextual interpretation of model parameters and model selection criteria. To illustrate this point, we review several established strategies for selecting model covariates, describe their shortcomings, and point to refinements, assuming that the main goal is to derive the most accurate effect estimates obtainable from the data and available resources. This goal shifts the focus to prediction of exposure or potential outcomes (or both) to adjust for confounding; it thus differs from the goal of ordinary statistical modeling, which is to passively predict outcomes. Nonetheless, methods and software for passive prediction can be used for causal inference as well, provided that the target parameters are shifted appropriately.
We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain ...subsets of variables, extending the usual $\ell_1$-norm and the group $\ell_1$-norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings.
•A novel method DPP-VSE is designed to construct good variable selection ensembles.•Discrete DPPs are utilized to infer a probability distribution of model size.•A sample from the distribution ...specifies the number of variables selected by each member.•The simulated and real experiments are conducted to study DPP-VSE’s performance.•DPP-VSE outperforms its rivals in most cases and has fewer parameters to specify.
As an effective tool to analyze high-dimensional data, variable selection is playing an increasingly important role in many fields. In recent years, variable selection ensembles (VSEs) have gained much interest of researchers due to their great potential to improve selection accuracy and to stabilize the results of traditional selection methods. Inspired by one common practice of Bayesian methods, we propose in this paper a novel technique named DPP-VSE to build a VSE by utilizing determinantal point processes (DPP) to infer a distribution of model size. By sampling from this distribution, DPP-VSE has the advantage that the number of variables for a base learner to select can be automatically determined. In contrast to other VSE strategies, it has fewer parameters for users to specify. The experiments conducted with both synthetic and real data illustrate that DPP-VSE performs best under most circumstances when being evaluated with several metrics. Hence, DPP-VSE can be seen as an effective and easy to use method to solve variable selection problems.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Direct quantification analysis of near-infrared (NIR) spectra is challenging because the number of spectral variables is usually considerably higher than the number of samples. To mitigate the ...so-called curse of dimensionality, variable selection is often performed before multivariate calibration. There has been much work in this regard, where the developed variable selection method can be categorized as individual variable selection, such as uninformative variable elimination or variable importance in projection, and continuous interval variable selection method such as interval partial least squares or moving window partial least squares. In this study, a new individual variable selection method, modified simulated annealing (MSA), was proposed and used in conjunction with the partial least squares regression (PLSR) model. The interpretability of the selected variables in the determination of aflatoxin B1 levels in white rice was assessed. The results revealed that the PLSR model combined with MSA not only yielded higher accuracy than the full-spectrum PLSR but also successfully shrank the variable space. The developed simplified PLSR model using MSA produced satisfactory performances, with root mean square error of calibration (RMSEC) of 0.11 μg/kg and 0.56 μg/kg, and root mean square error of prediction (RMSEP) of 7.16 μg/kg and 14.42 μg/kg, were obtained for the low-aflatoxin B1-level- and high-aflatoxin-B1-level samples, respectively. Specifically, the MSA-based models yielded improvements of 97.80% (calibration set) and 44.62% (prediction set) as well as 95.85% (calibration set) and 62.57% (prediction set) for both datasets when compared with the full-spectrum PLSR (low aflatoxin: RMSEC = 5.02 μg/kg, RMSEP = 12.93 μg/kg; high aflatoxin: RMSEC = 13.50 μg/kg, RMSEP = 38.53 μg/kg). Compared with the baseline method of simulated annealing (SA) (low aflatoxin: RMSEC = 0.21 μg/kg, RMSEP = 9.78 μg/kg; high aflatoxin: RMSEC = 12.27 μg/kg, RMSEP = 38.53 μg/kg), the MSA significantly improved the predictive performance of the regression models, with the number of selected variables being almost half of that in the SA. A comparison with other commonly used variable selection methods of selectivity ratio (low aflatoxin: RMSEC = 6.09 μg/kg, RMSEP = 13.75 μg/kg; high aflatoxin: RMSEC = 13.74 μg/kg, RMSEP = 41.13 μg/kg), uninformative variable elimination (low aflatoxin: RMSEC = 0.32 μg/kg, RMSEP = 5.11 μg/kg; high aflatoxin: RMSEC = 3.80 μg/kg, RMSEP = 17.76 μg/kg), and variable importance in projection (low aflatoxin: RMSEC = 2.67 μg/kg, RMSEP = 10.71 μg/kg; high aflatoxin: RMSEC = 13.51 μg/kg, RMSEP = 32.53 μg/kg) also indicated the promising efficacy of the proposed MSA.
•The modified simulated annealing (MSA) improved the accuracy of regression model.•The proposed MSA reduced the complexity of regression model.•Coupling of the MSA and partial least squares model outperformed others.•Number of selected variables by MSA being almost 50% of that in the baseline method.•Near-infrared successfully predicted the aflatoxin level in rice sample.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
In recent years, the significance of machine learning in agriculture has surged, particularly in post-harvest monitoring for sustainable aquaculture. Challenges like heterogeneity, irrelevant ...variables and multicollinearity hinder the implementation of smart monitoring systems. However, this study focuses on investigating heterogeneity among drying parameters that determine the moisture content removal during seaweed drying due to its limited attention, particularly within the field of agriculture. Additionally, a heterogeneity model within machine learning algorithms is proposed to enhance accuracy in predicting seaweed moisture content removal, both before and after the removal of heterogeneity parameters and also after the inclusion of single-eliminated heterogeneity parameters. The dataset consists of 1914 observations with 29 independent variables, but this study narrows down to five: Temperature (T1, T4, T7), Humidity (H5), and Solar Radiation (PY). These variables are interacted up to second-order interactions, resulting in 55 variables. Variance inflation factor and boxplots are employed to identify heterogeneity parameters. Two predictive machine learning models, namely random forest and elastic net are then utilized to identify the 15 and 20 highest important parameters for seaweed moisture content removal. Evaluation metrics (MSE, SSE, MAPE, and R-squared) are used to assess model performance. Results demonstrate that the random forest model outperforms the elastic net model in terms of higher accuracy and lower error, both before and after removing heterogeneity parameters, and even after reintroducing single-eliminated heterogeneity parameters. Notably, the random forest model exhibits higher accuracy before excluding heterogeneity parameters.
We propose a new empirical Bayes approach for inference in the p » n normal linear model. The novelty is the use of data in the prior in two ways, for centering and regularization. Under suitable ...sparsity assumptions, we establish a variety of concentration rate results for the empirical Bayes posterior distribution, relevant for both estimation and model selection. Computation is straightforward and fast, and simulation results demonstrate the strong finite-sample performance of the empirical Bayes model selection procedure.
Full text
Available for:
BFBNIB, INZLJ, NMLJ, NUK, PNG, UL, UM, UPUK, ZRSKP
This paper is about variable selection with the random forests algorithm in presence of correlated predictors. In high-dimensional regression or classification frameworks, variable selection is a ...difficult task, that becomes even more challenging in the presence of highly correlated predictors. Firstly we provide a theoretical study of the permutation importance measure for an additive regression model. This allows us to describe how the correlation between predictors impacts the permutation importance. Our results motivate the use of the recursive feature elimination (RFE) algorithm for variable selection in this context. This algorithm recursively eliminates the variables using permutation importance measure as a ranking criterion. Next various simulation experiments illustrate the efficiency of the RFE algorithm for selecting a small number of variables together with a good prediction error. Finally, this selection algorithm is tested on the Landsat Satellite data from the UCI Machine Learning Repository.
Full text
Available for:
EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OBVAL, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
In the pivotal variable selection problem, we derive the exact nonasymptotic minimax selector over the class of all s-sparse vectors, which is also the Bayes selector with respect to the uniform ...prior. While this optimal selector is, in general, not realizable in polynomial time, we show that its tractable counterpart (the scan selector) attains the minimax expected Hamming risk to within factor 2, and is also exact minimax with respect to the probability of wrong recovery. As a consequence, we establish explicit lower bounds under the monotone likelihood ratio property and we obtain a tight characterization of the minimax risk in terms of the best separable selector risk. We apply these general results to derive necessary and sufficient conditions of exact and almost full recovery in the location model with light tail distributions and in the problem of group variable selection under Gaussian noise and under more general anisotropic sub-Gaussian noise. Numerical results illustrate our theoretical findings.
Display omitted
•The optimized continuous wavelet transform (CWT) was used to preprocess soil spectra.•Extreme learning machine combined with CARS, SPA, MCUVE and GA was applied to determine pH of ...lime concretion black soil.•The GA-ELM and CARS-ELM achieved good results for determining pH in lime concretion black soil using Vis-NIR spectroscopy.•The prediction mechanism of soil pH using Vis-NIR spectroscopy in lime concretion black soil was presented.
Variable selection is widely accepted as an important step in the quantitative analysis of visible and near-infrared (Vis-NIR) spectroscopy, as it tends to improve the model’s robustness and predictive ability. In this study, a total of 140 lime concretion black soil samples were collected from two towns in Guoyang County, China. The Vis-NIR spectra measured in the laboratory were used to estimate soil pH by an extreme learning machine (ELM). First, the soil spectra were treated by the optimized continuous wavelet transform (CWT), and then four spectral feature selection methods (competitive adaptive reweighted sampling, CARS; successive projections algorithm, SPA; Monte Carlo uninformative variable elimination, MCUVE; genetic algorithm, GA) were applied with ELM in the CWT domain to determine the techniques with most predictions. For comparison, The PLS and SVM models were also developed. The coefficient of determination (R2), root mean square error (RMSE), and residual prediction deviation (RPD) were used to evaluate the model performance. Based on the validation dataset, the performance of the ELM models was superior to that of the PLS and SVM models expect SPA and MCUVE. In the ELM models, the order of the prediction accuracy was GA-ELM (R2p = 0.86; RMSEp = 0.1484; RPD = 2.64), CARS-ELM (R2p = 0.84; RMSEp = 0.1565; RPD = 2.50), ELM (R2p = 0.84; RMSEp = 0.1572; RPD = 2.49), SPA-ELM (R2p = 0.84; RMSEp = 0.1589; RPD = 2.47) and MCUVE-ELM (R2p = 0.83; RMSEp = 0.1599; RPD = 2.45). The proposed method of CARS-ELM had a relatively strong ability for spectral variable selection while retaining excellent prediction accuracy and short computing time (0.39 s). In addition, the variables selected by the four methods (CARS, SPA, MCUVE and GA) indicated the prediction mechanism for pH in lime concretion black soil may be the relation between pH and iron oxides and organic matter. In conclusion, CARS-ELM has great potential to accurately determine the pH in lime concretion black soil using Vis-NIR spectroscopy.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP