VSN: Variable sorting for normalization Rabatel, Gilles; Marini, Federico; Walczak, Beata ...
Journal of chemometrics,
February 2020, 2020-02-00, 20200201, 2020-02, Letnik:
34, Številka:
2
Journal Article
Recenzirano
Spectrometric and analytical techniques in general collect multivariate signals from chemical or biological materials by means of a specific measurement instrumentation, usually in order to ...characterize or classify them through the estimation of one of several compounds of interest. However, measurement conditions might induce various additive (baseline) or multiplicative effects on the collected signals, which may jeopardize the accuracy and generalizability of estimation models. A common way of dealing with such issues is signal normalization and in particular, when the baseline is constant, the standard normal variate (SNV) transform. Despite its efficiency, SNV has important drawbacks, in terms of physical interpretation and robustness of estimation models, because all the variables are equally considered, independently on what their actual relationship with the response(s) of interest is. In the present study, a novel algorithm is proposed, named variable sorting for normalization (VSN). This algorithm automatically produces, for a given set of multivariate signals, a weighting function favoring signal variables that are only impacted by additive and multiplicative effects, and not by the response(s) of interest. When introduced in SNV preprocessing, this weighting function significantly improves signal shape and model interpretation. Moreover, VSN can be successfully used not only for constant but also with more complex baselines, such as polynomial ones. Together with the description of the theory behind VSN, its application on various synthetic multivariate data, as well as on real SWIR spectral data, is presented and discussed.
A common way of dealing with variations in measurement conditions is signal normalization, which may have important drawbacks, in terms of physical interpretation and robustness of models. In the present study, a novel algorithm is proposed. It automatically produces a weighting function favoring variables that are only impacted by additive and multiplicative effects. When introduced in a normalization preprocessing, this weighting function significantly improves signal shape and model interpretation.
With the development of technology and the relatively higher availability of new instrumentations, having multiblock data sets (eg, a set of samples analyzed by different analytical techniques) is ...becoming more and more common and, as a consequence, how to handle this kind of outcomes is a widely discussed topic. In such a context, where the number of involved variables is relatively high, selecting the most significant features is obviously relevant. For this reason, the possibility of joining a multiblock regression method, the sequential and orthogonalized partial least‐squares (SO‐PLS), with a variable selection approach called covariance selection (CovSel), has been investigated. The resulting method, sequential and orthogonalized covariance selection (SO‐CovSel) is similar to SO‐PLS, but the feature reduction provided by PLS is performed by CovSel. Finally, predictions are made by applying multiple linear regression on the subset of selected variables. The novel approach has been tested on different multiblock data sets both in regression and in classification (by combination with LDA), and it has been compared with another state‐of‐the‐art multiblock method. SO‐CovSel has demonstrated to be suitable for its purpose: It has provided good predictions (both in regression and in classification) and, from the interpretation point of view, it has led to a meaningful selection of the original variables.
Even though NIR spectroscopy is based on the Beer–Lambert law, which clearly relates the concentration of the absorbing elements with the absorbance, the measured spectra are subject to spurious ...signals, such as additive and multiplicative effects. The use of NIR spectra, therefore, requires a preprocessing step. This article reviews the main preprocessing methods in the light of aquaphotomics. Simple methods for visualizing the spectra are proposed in order to guide the user in the choice of the best preprocessing. The most common chemometrics preprocessing are presented and illustrated by three real datasets. Some preprocessing aims to produce a spectrum as close as possible to the absorbance that would have been measured under ideal conditions and is very useful for the establishment of an aquagram. Others, dedicated to the improvement of the resolution of the spectra, are very useful for the identification of the peaks. Finally, special attention is given to the problem of reducing multiplicative effects and to the potential pitfalls of some very popular methods in chemometrics. Alternatives proposed in recent papers are presented.
In multivariate calibrations, locally weighted partial least squared regression (LWPLSR) is an efficient prediction method when heterogeneity of data generates nonlinear relations (curvatures and ...clustering) between the response and the explicative variables. This is frequent in agronomic data sets that gather materials of different natures or origins. LWPLSR is a particular case of weighted PLSR (WPLSR; ie, a statistical weight different from the standard 1/n is given to each of the n calibration observations for calculating the PLS scores/loadings and the predictions). In LWPLSR, the weights depend from the dissimilarity (which has to be defined and calculated) to the new observation to predict. This article compares two strategies of LWPLSR: (a) “LW”: the usual strategy where, for each new observation to predict, a WPLSR is applied to the n calibration observations (ie, entire calibration set) vs (b) “KNN‐LW”: a number of k nearest neighbors to the observation to predict are preliminary selected in the training set and WPLSR is applied only to this selected KNN set. On three illustrating agronomic data sets (quantitative and discrimination predictions), both strategies overpassed the standard PLSR. LW and KNN‐LW had close prediction performances, but KNN‐LW was much faster in computation time. KNN‐LW strategy is therefore recommended for large data sets. The article also presents a new algorithm for WPLSR, on the basis of the “improved kernel #1” algorithm, which is competitor and in general faster to the already published weighted PLS nonlinear iterative partial least squares (NIPALS).
Locally weighted partial least squared regression (LWPLSR) is a particular case of weighted PLSR (WPLSR) where the weights, given to the calibration observations for calculating the PLS scores/loadings and the prediction, depend on the dissimilarity to the new observation to predict. This article compares two strategies of LWPLSR: (a) “LW”: the usual LWPLSR strategy vs (b) “KNN‐LW”: a number of k nearest neighbors to the observation to predict are preliminary selected and WPLSR is applied only to these neighbors.
Near-infrared (NIR) and mid-IR spectroscopy applied to soil compositional analysis started to develop markedly in the 1990s, taking advantage of earlier advances in instrumentation and chemometrics ...for agricultural products. Today, NIR spectroscopy is envisioned as replacing laboratory analysis in certain applications (e.g., soil-carbon-credit assessment at the farm level). However, accuracy is still unsatisfactory compared with standard laboratory procedures, leading some authors to think that such a challenge will never be met.
This article investigates the critical points to be aware of when accuracy of NIR-based measurements is assessed. First is the decomposition of the standard error of prediction into components of bias and variance, only the latter being reducible by averaging. This decomposition is not used routinely in the soil-science literature. Contrarily, a log-normal distribution of reference values is very often encountered with soil samples e.g., elemental concentrations (e.g., carbon) with numerous small or zero values. These very skewed distributions make us take precautions when using inverse regression methods (e.g., principal component regression or partial least squares), which force the predictions towards the centre of the calibration set, leading to negative effects on the standard error prediction – and therefore on prediction accuracy – especially when log-normal distributions are encountered. Such distributions, which are very common for soil components, also make the ratio of performance to deviation a useless, even hazardous, tool, leading to erroneous conclusions.
We propose a new index based on the quartiles of the empirical distribution – ratio of performance to inter-quartile distance – to overcome this problem
.
Abstract
The present study is about the modulations of the aerodynamic broadband noise heard from slowly rotating rotors with few blades. It is aimed at producing a fast-running prediction tool that ...could be used to assess the nuisance of Vertical-Axis Wind Turbines (VAWT) in the context of urban installation. The simpler case of a skipping rope that only involves part of the modulation effect is addressed as a first step. The rope is split into segments for which an instantaneous sound-radiation model exists to yield an overall spectrogram. Finally, its time signature is reconstructed by additive synthesis. After inspection of existing databases from oscillatory-airfoil experiments, needs for data representative of the dynamic stall of VAWT blades are seen as the missing block to reconstruct a synthetic spectrogram.
Mallows's Cp and Akaike information criterion (AIC) are common criteria for selecting the dimensionality of regression models, as an alternative to cross‐validation (CV) and nonparametric bootstrap. ...A key parameter in the calculation of Cp and AIC is the effective number of degrees of freedom of the model, or model complexity (d). Parameter d is generally easy to calculate for linear smoothers, that is, models for which the prediction of the training response y is given by
ŷ = S y where S is a projector matrix that does not involve y. Nevertheless, d is more difficult to estimate for nonlinear smoothers, such as partial least squares regression (PLSR). In this article, we present two algorithms for estimating d for PLSR based on Monte Carlo simulation methods (parametric bootstrap and perturbation analysis) and with the particular case of high dimensional data. We compare these Monte Carlo methods to three other algorithms already published. We used the d estimates to compute Cp and AIC and select PLSR model dimensionalities that we then compare to CV. Two real and heterogeneous agronomic near infrared (NIR) datasets were considered as examples.
Inflammation in the context of Human Immunodeficiency Virus (HIV) establishes early and persists beyond antiretroviral therapy (ART). As such, we have shown excess B-cell activating factor (BAFF) in ...the blood of HIV-infected progressors, as soon as in the acute phase, and despite successful ART. Excess BAFF was associated with deregulation of the B-cell compartment; notably, with increased frequencies of a population sharing features of both transitional immature (TI) and marginal zone (MZ) B-cells, we termed Marginal Zone precursor-like (MZp). We have reported similar observations with HIV-transgenic mice, Simian Immunodeficiency Virus (SIV)-infected macaques, and more recently, with HIV-infected Beninese commercial sex workers, which suggests that excess BAFF and increased frequencies of MZp B-cells are reliable markers of inflammation in the context of HIV. Importantly, we have recently shown that in healthy individuals, MZps present an important regulatory B-cell (Breg) profile and function. Herein, we wish to review our current knowledge on MZ B-cell populations, especially their Breg status, and that of other B-cell populations sharing similar features. BAFF and its analog A Proliferation-Inducing Ligand (APRIL) are important in shaping the MZ B-cell pool; moreover, the impact that excess BAFF-encountered in the context of HIV and several chronic inflammatory conditions-may exert on MZ B-cell populations, Breg and antibody producing capacities is a threat to the self-integrity of their antibody responses and immune surveillance functions. As such, deregulations of MZ B-cell populations contribute to autoimmune manifestations and the development of MZ lymphomas (MZLs) in the context of HIV and other inflammatory diseases. Therefore, further comprehending the mechanisms regulating MZ B-cell populations and their functions could be beneficial to innovative therapeutic avenues that could be deployed to restore MZ B-cell immune competence in the context of chronic inflammation involving excess BAFF.
Calibration transfer is an essential activity in analytical chemistry in order to avoid a complete recalibration. Currently, the most popular calibration transfer methods, such as piecewise direct ...standardization and dynamic orthogonal projection, require a certain amount of standard or reference samples to guarantee their effectiveness. To achieve higher efficiency, it is desirable to perform the transfer with as few reference samples as possible.
To this end, we propose a new calibration transfer method by using a calibration database from a master instrument (source domain) and only one spectrum with known properties from a slave instrument (target domain). We first generate a counterpart of this spectrum in the source domain by a multivariate Gaussian kernel. Then, we train a filter to make the response function of the slave instrument equivalent to that of the master instrument. To avoid the need for labels from the target domain, we also propose an unsupervised way to implement our method. Compared with several state-of-the-art methods, the results on one simulated dataset and two real-world datasets demonstrate the effectiveness of our method.
Traditionally, the demand for certain amounts of reference samples during calibration transfer is cumbersome. Our approach, which requires only one reference sample, makes the transfer process simple and fast. In addition, we provide an alternative for performing unsupervised calibration transfer. As such, the proposed method is a promising tool for calibration transfer.
Display omitted
•A new calibration transfer method is introduced, which requires only one spectrum from target domain.•The calibration transfer method can be conducted in an unsupervised way.•A multivariate Gaussian kernel is introduced to generate a virtual sample in source domain.•Results from simulated and real-world datasets demonstrate the effectiveness of our method.
Field measurement using NIR spectroscopy is becoming a popular method to provide in situ, rapid, and inexpensive estimation of soil organic carbon (SOC) content. However NIR reflectance is quite ...sensitive to external environmental conditions, such as temperature and soil moisture. In the field, the soil moisture content can be highly variable. It is a challenge to find a chemometric method that allows for prediction of soil organic carbon from spectra obtained under field conditions that is insensitive to variable moisture content. This paper utilises an external parameter orthogonalisation (EPO) algorithm to remove the effect of soil moisture from NIR spectra for the calibration of SOC content. The algorithm projects all the soil spectra orthogonal to the space of unwanted variation, and thus the variations of soil moisture can be effectively removed. We designed a protocol with 3 independent datasets to be used for calibration of NIR spectra: (1) the calibration dataset, which contains soil samples with measured spectra and SOC content under standard (or laboratory) condition (air-dried), (2) the EPO development dataset contains spectra under laboratory condition (air-dried samples) and spectra collected under field conditions (varying soil moisture content), and (3) the validation dataset contains spectra collected under field condition and measured SOC content. We conducted experiments using soils at different moisture contents in laboratory conditions. Using the EPO algorithm, we are able to remove the effect of soil moisture from the spectra, which resulted in improved calibration and prediction of SOC content.
► The External Parameter Orthogonalisation (EPO) technique can remove the effect of soil moisture from soil near Infrared spectra. ► The EPO pre-processed predict soil organic carbon content more accurately independent of soil moisture. ► A protocol with 3 independent datasets was designed for calibration of field NIR spectra.