With the advances in innovative instrumentation and various valuable applications, near-infrared (NIR) spectroscopy has become a mature analytical technique in various fields. Variable (wavelength) ...selection is a critical step in multivariate calibration of NIR spectra, which can improve the prediction performance, make the calibration reliable and provide simpler interpretation. During the last several decades, there have been a large number of variable selection methods proposed in NIR spectroscopy. In this paper, we generalize variable selection methods in a simple manner to introduce their classifications, merits and drawbacks, to provide a better understanding of their characteristics, similarities and differences. We also introduce some hybrid and modified methods, highlighting their improvements. Finally, we summarize the limitations of existing variable selection methods, providing our remarks and suggestions on the development of variable selection methods, to promote the development of NIR spectroscopy.
•Generalize variable selection methods in a simple manner to provide a better understanding of their characteristics.•Introduce their modified and hybrid methods and highlighting their improvements.•Summarize the limitations and mention seven aspects of the problem affecting the existing variable selection methods.•Provide our remarks and suggestions on the trends of the development on the variable selection methods in NIR spectra.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
We introduce the ProSGPV R package, which implements a variable selection algorithm based on second-generation p-values (SGPV) instead of traditional p-values. Most variable selection algorithms ...shrink point estimates to arrive at a sparse solution. In contrast, the ProSGPV algorithm accounts for the estimation uncertainty - via confidence intervals - in the selection process. This additional information leads to better inference and prediction performance in finite sample sizes. ProSGPV maintains good performance even in the high dimensional case where $p>n$, or when explanatory variables are highly correlated. Moreover, ProSGPV is a unifying algorithm that works with continuous, binary, count, and time-to-event outcomes. No cross-validation or iterative processes are needed and thus ProSGPV is very fast to compute. Visualization tools are available in this package for assessing the variable selection process. Here we present simulation studies and a real-world example to demonstrate ProSGPV's inference and prediction performance in relation to the current standards in variable selection procedures.
Distributed penalized generalized linear regression algorithms have been widely studied in recent years. However, they all assume that the data should be randomly distributed. In real applications, ...this assumption is not necessarily true, since the whole data are often stored in a non-random manner. To tackle this issue, a non-convex penalized distributed pilot sample surrogate negative log-likelihood learning procedure is developed, which can realize distributed high-dimensional variable selection for generalized linear models, and be adaptive to the non-random situations. The established theoretical results and numerical studies all validate the proposed method.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
The ongoing digitalisation of the district heating sector, particularly the installation of smart heat meters (SHMs), is generating data with unprecedented extent and temporal resolution. This data ...offers potential insights into heat energy use at a large scale, supporting policymakers and district heating utility companies in transforming the building sector. Clustering is crucial for representing this wealth of data in human-understandable groups, necessitating consideration of seasonality.
Advancing current research in clustering SHM data, this work applies an established co-clustering approach, FunLBM, considering seasonal variation without fixed season definitions. Furthermore, to enhance the understanding of differentiating factors between clusters, the possibility to understand cluster memberships based on 26 building characteristics was analysed using classification and variable selection methods.
Applying FunLBM on a large-scale hourly dataset from single-family houses revealed six well-separated energy use clusters each distributed over six-temporal clusters, which are correlated with the exterior temperature, yet not following fixed seasons. Variable selection and classification showed that building characteristics describing the building with a high level of detail are insufficient to explain cluster membership (Matthew’s correlation coefficient (MCC) ≈0.3).
By merging the energy use clusters based on profile and magnitude similarities, classification performance significantly improved (MCC ≈0.5). In both cases, simple and readily available building characteristics yield similar insights to detailed ones, emphasising their cost-effectiveness and practicality.
Display omitted
•Co-clustering of smart heat meter data to establish season-independent clusters.•Analysis of energy use clusters based on 26 building characteristics.•Classification and variable selection to identify the minimum information needed.•Statistical data leads to the same insight as detailed building data.•Prediction of energy use clusters with building characteristics has a low accuracy.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
In the meat industry, it is essential to monitor the total volatile basic nitrogen (TVB-N) content due to its impact on human diets safety. In the present study, two-dimensional correlation ...spectroscopy (2D-COS) was combined with dual-band visible/near-infrared spectroscopy to identify the characteristic wavelengths of TVB-N in pork meat during storage. Reflectance spectra of pork longissimus dorsi muscles at 350–1100 nm and 1000–2500 nm were obtained during a 13-day storage period. The dual-band spectra were merged via parabolic fitting, and subjected to pre-treatments using absorbance conversion, first derivation (FD), and standard normal variate transformation (SNV). Then, the storage day was used as the external perturbation factor, and 2D-COS was performed on the dynamic spectra to identify TVB-N-related wavebands. Twenty-five characteristic wavelengths were selected, and a simplified partial least square regression (PLSR) model was established with a correlation coefficient (Rp) of 0.9591, root mean square error on prediction (RMSEP) of 1.8341 mg/100 g, and relative percent deviation (RPD) of 3.2518. These results outperformed those of the model based on full wavebands data (Rp = 0.9215, RMSEP = 2.8191 mg/100 g, RPD = 2.1114). This study accurately predicted TVB-N values in pork meat, and the integration of 2D-COS and dual-band spectra had great potential for real-time meat freshness evaluation.
•2D-COS was employed to identify feature variables for TVB-N prediction in pork.•Dual-band visible/near infrared spectroscopy was used to improve model performance.•25 characteristic wavelengths were selected to build simplified PLSR model.•TVB-N content was well predicted with Rp of 0.9591 and RPD of 3.2518.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
In recent years, modern spectral analysis techniques, such as ultraviolet–visible (UV-vis) spectroscopy, mid-infrared (MIR) spectroscopy, near-infrared (NIR) spectroscopy, Raman spectroscopy, ...terahertz (THz) spectroscopy, nuclear magnetic resonance (NMR) spectroscopy, laser-induced breakdown spectroscopy (LIBS), etc., have experienced rapid development and have been widely applied in various fields such as agricultural, food, pharmaceutical, petroleum, chemical industry, tobacco, environmental protection and medical science. A remarkable feature of all these techniques is to extract useful chemical information from the spectral data as detailed as possible with the aid of chemometric methods with the aim of significantly improving both robustness and accuracy of analytical results. Under the general background of the development in artificial intelligence, big data, cloud computing, and other technologies, the emergence of novel idea, approaches, and strategies endows chemometrics with a new vitality. Chemometrics has become the research focuses and hotspots in various fields, especially in the field of spectral analysis. This article reviewed various chemometric methods applied in modern spectral analysis in recent ten years, especially from the perspective of practicability, including spectral pre-processing, wavelength (variable) selection, data dimension reduction, quantitative calibration, pattern recognition, calibration transfer, calibration maintenance, and multispectral data fusion. More importantly, future trends in chemometric methods in the field of spectral analysis was also prospected in this article. It is sincerely expected that this summary and review could give specialists and scholars in the fields of spectroscopy and chemometrics certain inspiration to accelerate modern spectral analysis techniques booming evolution.
•Chemometric methods in modern spectroscopy are reviewed in the last ten years.•Related issues of chemometric methods in practical applications are discussed.•Future trends in chemometric methods in modern spectroscopy are prospected.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Wood is the main feedstock source for pulp and paper industry. However, chemical composition variations from multispecies and multisource feedstock heavily affect the production continuity and ...stability. As a rapid and non-destructive analysis technique, near infrared (NIR) spectroscopy provides an alternative for wood properties on-line analysis and feedstock quality control. Herein, near infrared spectroscopy coupled with partial least squares (PLS) regression was used to predict holocellulose and lignin contents of various wood species including poplars, eucalyptus and acacias. In order to obtain more accurate and robust prediction models, a comparison was conducted among several variable selection methods for NIR spectral variables optimization, including competitive adaptive reweighted sampling (CARS), Monte Carlo-uninformative variable elimination (MC-UVE), successive projections algorithm (SPA), and genetic algorithm (GA). The results indicated that CARS method displayed relatively higher efficiency over other methods in elimination of uninformative variables as well as enhancement of the predictive performance of models. CARS-PLS models showed significantly higher robustness and accuracy for each property using lowest variable numbers in cross validation and external validation, demonstrating its applicability and reliability for prediction of multispecies feedstock properties.
Display omitted
•Holocellulose and lignin content of multispecies hardwoods were predicted accurately by NIR spectroscopy.•CARS method effectively enhanced the accuracy and robustness of NIR prediction models.•An efficient and concise quantitative analysis tool was proposed to guide future pulp feedstock quality assessment.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
This paper discusses regression analysis of case K interval-censored failure time data, a general type of failure time data, in the presence of informative censoring with the focus on simultaneous ...variable selection and estimation. Although many authors have considered the challenging variable selection problem for interval-censored data, most of the existing methods assume independent or non-informative censoring. More importantly, the existing methods that allow for informative censoring are frailty model-based approaches and cannot directly assess the degree of informative censoring among other shortcomings. To address these, we propose a conditional approach and develop a penalized sieve maximum likelihood procedure for the simultaneous variable selection and estimation of covariate effects. Furthermore, we establish the oracle property of the proposed method and illustrate the appropriateness and usefulness of the approach using a simulation study. Finally we apply the proposed method to a set of real data on Alzheimer's disease and provide some new insights.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP