Calorific value (CV) reflects the ability of material flow and energy conversion of plants, which is the key indices of combustion properties for utilization and development of energy plants. ...However, the commonly used method for CV determination of solid fuels is bomb calorimetry in the laboratory using powder samples, which hinders the capability of rapid and non-destructive prediction for a large-scale samples in a natural environment. Visible and near infrared spectroscopy (Vis-NIR) has been widely proposed as a replace for laboratory determination in properties prediction. However, chemometrics are essential for spectral analysis. Various chemometrics including competitive adaptive reweighted sample (CARS), lifting wavelet transform (LWT), successive projections algorithm (SPA), and convolutional neural networks (CNNs) optimized by whale optimization algorithm (WOA) were employed to optimize models. Additionally, canopy spectra were measured in the field instead of powder samples’ spectra collecting from laboratory. The results demonstrated that CARS-WOA-CNN was the best to predict CV and ash content (AC) with R2 of 0.858 and 0.751, respectively. Compared to raw full spectra, spectral dimension was reduced from 2048 to 93 and 22 for CV and AC, respectively. Overall, this study provided a meaningful strategy for harvest planning and assessing value of biomass in the field.
Display omitted
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
As a crucial indicator of forest growth and quality, estimating aboveground biomass (AGB) plays a key role in monitoring the global carbon cycle and forest health assessments. Novel methods and ...applications in remote sensing technology can greatly reduce the investigation time and cost and therefore have the potential to efficiently estimate AGB. Random forest (RF), combined with remote sensing images, is a popular machine learning method that has been widely used for AGB estimation. However, the accuracy of the ordinary linear variable selection method in the AGB estimation of coniferous forests is challenging due to the complexity of these forest biomes. In this study, spectral variables (spectral reflectance and vegetation index), land surface temperature (LST) and soil moisture were extracted from the operational land imager (OLI) and thermal infrared sensor (TIRS) of Landsat 8, and optimized RF regressions were established to estimate the AGB of coniferous forests in the Wangyedian forest farm, Inner Mongolia, Northeast China. We applied one linear (Pearson correlation coefficient (PC)) and four nonlinear (Kendall's τ coefficient (KC), Spearman coefficient (SC), distance correlation coefficient (DC) and the importance index) indices to select variables and establish optimized RF regressions for AGB estimation. The results showed that all the nonlinear indices provided significantly lower estimation errors than the linear index, in which the minimum root mean square error (RMSE) of 40.92 Mg/ha was obtained by the importance index in the nonlinear indices. In addition, the inclusion of LST and soil moisture significantly improved AGB estimation. The RMSE of the models constructed through the five indices decreased by 12.93%, 7.31%, 8.33%, 6.28% and 10.78%, respectively, following the application of the LST variable. In particular, when LST and soil moisture were both added into the model, the RMSE decreased by 31.47%. This study demonstrates that combining the nonlinear variable selection method with optimized RF regression can improve the efficiency of AGB estimation to support regional forest resource management and monitoring.
Display omitted
•The optimized random forest regressions are proposed for AGB estimation and mapping.•Nonlinear and linear variable selection methods are compared.•The land surface temperature and soil moisture can significantly improve AGB estimation.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Swarm-based algorithms emerged as a powerful family of optimization techniques, inspired by the collective behavior of social animals. In particle swarm optimization (PSO) the set of candidate ...solutions to the optimization problem is defined as a swarm of particles which may flow through the parameter space defining trajectories which are driven by their own and neighbors' best performances. In the present paper, the potential of particle swarm optimization for solving various kinds of optimization problems in chemometrics is shown through an extensive description of the algorithm (highlighting the importance of the proper choice of its metaparameters) and by means of selected worked examples in the fields of signal warping, estimation robust PCA solutions and variable selection.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK
In many fields of science, we observe a response variable together with a large number of potential explanatory variables, and would like to be able to discover which variables are truly associated ...with the response. At the same time, we need to know that the false discovery rate (FDR)—the expected fraction of false discoveries among all discoveries—is not too high, in order to assure the scientist that most of the discoveries are indeed true and replicable. This paper introduces the knockoff filter, a new variable selection procedure controlling the FDR in the statistical linear model whenever there are at least as many observations as variables. This method achieves exact FDR control in finite sample settings no matter the design or covariates, the number of variables in the model, or the amplitudes of the unknown regression coefficients, and does not require any knowledge of the noise level. As the name suggests, the method operates by manufacturing knockoff variables that are cheap—their construction does not require any new data—and are designed to mimic the correlation structure found within the existing variables, in a way that allows for accurate FDR control, beyond what is possible with permutation-based methods. The method of knockoffs is very general and flexible, and can work with a broad class of test statistics. We test the method in combination with statistics from the Lasso for sparse regression, and obtain empirical results showing that the resulting method has far more power than existing selection rules when the proportion of null variables is high.
Full text
Available for:
BFBNIB, INZLJ, NMLJ, NUK, PNG, SAZU, UL, UM, UPUK, ZRSKP
In recent years, the significance of machine learning in agriculture has surged, particularly in post-harvest monitoring for sustainable aquaculture. Challenges like heterogeneity, irrelevant ...variables and multicollinearity hinder the implementation of smart monitoring systems. However, this study focuses on investigating heterogeneity among drying parameters that determine the moisture content removal during seaweed drying due to its limited attention, particularly within the field of agriculture. Additionally, a heterogeneity model within machine learning algorithms is proposed to enhance accuracy in predicting seaweed moisture content removal, both before and after the removal of heterogeneity parameters and also after the inclusion of single-eliminated heterogeneity parameters. The dataset consists of 1914 observations with 29 independent variables, but this study narrows down to five: Temperature (T1, T4, T7), Humidity (H5), and Solar Radiation (PY). These variables are interacted up to second-order interactions, resulting in 55 variables. Variance inflation factor and boxplots are employed to identify heterogeneity parameters. Two predictive machine learning models, namely random forest and elastic net are then utilized to identify the 15 and 20 highest important parameters for seaweed moisture content removal. Evaluation metrics (MSE, SSE, MAPE, and R-squared) are used to assess model performance. Results demonstrate that the random forest model outperforms the elastic net model in terms of higher accuracy and lower error, both before and after removing heterogeneity parameters, and even after reintroducing single-eliminated heterogeneity parameters. Notably, the random forest model exhibits higher accuracy before excluding heterogeneity parameters.
Direct quantification analysis of near-infrared (NIR) spectra is challenging because the number of spectral variables is usually considerably higher than the number of samples. To mitigate the ...so-called curse of dimensionality, variable selection is often performed before multivariate calibration. There has been much work in this regard, where the developed variable selection method can be categorized as individual variable selection, such as uninformative variable elimination or variable importance in projection, and continuous interval variable selection method such as interval partial least squares or moving window partial least squares. In this study, a new individual variable selection method, modified simulated annealing (MSA), was proposed and used in conjunction with the partial least squares regression (PLSR) model. The interpretability of the selected variables in the determination of aflatoxin B1 levels in white rice was assessed. The results revealed that the PLSR model combined with MSA not only yielded higher accuracy than the full-spectrum PLSR but also successfully shrank the variable space. The developed simplified PLSR model using MSA produced satisfactory performances, with root mean square error of calibration (RMSEC) of 0.11 μg/kg and 0.56 μg/kg, and root mean square error of prediction (RMSEP) of 7.16 μg/kg and 14.42 μg/kg, were obtained for the low-aflatoxin B1-level- and high-aflatoxin-B1-level samples, respectively. Specifically, the MSA-based models yielded improvements of 97.80% (calibration set) and 44.62% (prediction set) as well as 95.85% (calibration set) and 62.57% (prediction set) for both datasets when compared with the full-spectrum PLSR (low aflatoxin: RMSEC = 5.02 μg/kg, RMSEP = 12.93 μg/kg; high aflatoxin: RMSEC = 13.50 μg/kg, RMSEP = 38.53 μg/kg). Compared with the baseline method of simulated annealing (SA) (low aflatoxin: RMSEC = 0.21 μg/kg, RMSEP = 9.78 μg/kg; high aflatoxin: RMSEC = 12.27 μg/kg, RMSEP = 38.53 μg/kg), the MSA significantly improved the predictive performance of the regression models, with the number of selected variables being almost half of that in the SA. A comparison with other commonly used variable selection methods of selectivity ratio (low aflatoxin: RMSEC = 6.09 μg/kg, RMSEP = 13.75 μg/kg; high aflatoxin: RMSEC = 13.74 μg/kg, RMSEP = 41.13 μg/kg), uninformative variable elimination (low aflatoxin: RMSEC = 0.32 μg/kg, RMSEP = 5.11 μg/kg; high aflatoxin: RMSEC = 3.80 μg/kg, RMSEP = 17.76 μg/kg), and variable importance in projection (low aflatoxin: RMSEC = 2.67 μg/kg, RMSEP = 10.71 μg/kg; high aflatoxin: RMSEC = 13.51 μg/kg, RMSEP = 32.53 μg/kg) also indicated the promising efficacy of the proposed MSA.
•The modified simulated annealing (MSA) improved the accuracy of regression model.•The proposed MSA reduced the complexity of regression model.•Coupling of the MSA and partial least squares model outperformed others.•Number of selected variables by MSA being almost 50% of that in the baseline method.•Near-infrared successfully predicted the aflatoxin level in rice sample.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
We introduce a new estimator for the vector of coefficients β in the linear model y = Xβ + z, where X has dimensions n × p with p possibly larger than n. SLOPE, short for Sorted L-One Penalized ...Estimation, is the solution to $_{b \in {\mathbb{R}^p}}^{\min }\frac{1}{2}||y - Xb||_{{l^2}}^2 + {\lambda _1}|b{|_{(1)}} + {\lambda _2}|b{|_{(2)}} + \cdot \cdot \cdot + {\lambda _p}|b{|_{(p)}}$ where ${\lambda _1} \geqslant {\lambda _2} \geqslant \cdot \cdot \cdot \geqslant {\lambda _p} \geqslant 0and|b{|_{(1)}} \geqslant |b{|_{(2)}} \geqslant \cdot \cdot \cdot \geqslant |b{|_{(p)}}$ are the decreasing absolute values of the entries of b. This is a convex program and we demonstrate a solution algorithm whose computational complexity is roughly comparable to that of classical l₁ procedures such as the Lasso. Here, the regularizer is a sorted l₁ norm, which penalizes the regression coefficients according to their rank: the higher the rank—that is, stronger the signal—the larger the penalty. This is similar to the Benjamini and Hochberg J. Roy. Statist. Soc. Ser. B 57 (1995) 289-300 procedure (BH) which compares more significant p-values with more stringent thresholds. One notable choice of the sequence {λi} is given by the BH critical values λBH = z(1 — i · q/2p), where q ∊ (0, 1) and z(α) is the quantile of a standard normal distribution. SLOPE aims to provide finite sample guarantees on the selected model; of special interest is the false discovery rate (FDR), defined as the expected proportion of irrelevant regressors among all selected predictors. Under orthogonal designs, SLOPE with λBH provably controls FDR at level q. Moreover, it also appears to have appreciable inferential properties under more general designs X while having substantial power, as demonstrated in a series of experiments running on both simulated and real data.
Full text
Available for:
BFBNIB, INZLJ, NMLJ, NUK, PNG, UL, UM, UPUK, ZRSKP
•A novel method DPP-VSE is designed to construct good variable selection ensembles.•Discrete DPPs are utilized to infer a probability distribution of model size.•A sample from the distribution ...specifies the number of variables selected by each member.•The simulated and real experiments are conducted to study DPP-VSE’s performance.•DPP-VSE outperforms its rivals in most cases and has fewer parameters to specify.
As an effective tool to analyze high-dimensional data, variable selection is playing an increasingly important role in many fields. In recent years, variable selection ensembles (VSEs) have gained much interest of researchers due to their great potential to improve selection accuracy and to stabilize the results of traditional selection methods. Inspired by one common practice of Bayesian methods, we propose in this paper a novel technique named DPP-VSE to build a VSE by utilizing determinantal point processes (DPP) to infer a distribution of model size. By sampling from this distribution, DPP-VSE has the advantage that the number of variables for a base learner to select can be automatically determined. In contrast to other VSE strategies, it has fewer parameters for users to specify. The experiments conducted with both synthetic and real data illustrate that DPP-VSE performs best under most circumstances when being evaluated with several metrics. Hence, DPP-VSE can be seen as an effective and easy to use method to solve variable selection problems.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP