VSN: Variable sorting for normalization Rabatel, Gilles; Marini, Federico; Walczak, Beata ...
Journal of chemometrics,
February 2020, 2020-02-00, 20200201, 2020-02, Letnik:
34, Številka:
2
Journal Article
Recenzirano
Spectrometric and analytical techniques in general collect multivariate signals from chemical or biological materials by means of a specific measurement instrumentation, usually in order to ...characterize or classify them through the estimation of one of several compounds of interest. However, measurement conditions might induce various additive (baseline) or multiplicative effects on the collected signals, which may jeopardize the accuracy and generalizability of estimation models. A common way of dealing with such issues is signal normalization and in particular, when the baseline is constant, the standard normal variate (SNV) transform. Despite its efficiency, SNV has important drawbacks, in terms of physical interpretation and robustness of estimation models, because all the variables are equally considered, independently on what their actual relationship with the response(s) of interest is. In the present study, a novel algorithm is proposed, named variable sorting for normalization (VSN). This algorithm automatically produces, for a given set of multivariate signals, a weighting function favoring signal variables that are only impacted by additive and multiplicative effects, and not by the response(s) of interest. When introduced in SNV preprocessing, this weighting function significantly improves signal shape and model interpretation. Moreover, VSN can be successfully used not only for constant but also with more complex baselines, such as polynomial ones. Together with the description of the theory behind VSN, its application on various synthetic multivariate data, as well as on real SWIR spectral data, is presented and discussed.
A common way of dealing with variations in measurement conditions is signal normalization, which may have important drawbacks, in terms of physical interpretation and robustness of models. In the present study, a novel algorithm is proposed. It automatically produces a weighting function favoring variables that are only impacted by additive and multiplicative effects. When introduced in a normalization preprocessing, this weighting function significantly improves signal shape and model interpretation.
With the development of technology and the relatively higher availability of new instrumentations, having multiblock data sets (eg, a set of samples analyzed by different analytical techniques) is ...becoming more and more common and, as a consequence, how to handle this kind of outcomes is a widely discussed topic. In such a context, where the number of involved variables is relatively high, selecting the most significant features is obviously relevant. For this reason, the possibility of joining a multiblock regression method, the sequential and orthogonalized partial least‐squares (SO‐PLS), with a variable selection approach called covariance selection (CovSel), has been investigated. The resulting method, sequential and orthogonalized covariance selection (SO‐CovSel) is similar to SO‐PLS, but the feature reduction provided by PLS is performed by CovSel. Finally, predictions are made by applying multiple linear regression on the subset of selected variables. The novel approach has been tested on different multiblock data sets both in regression and in classification (by combination with LDA), and it has been compared with another state‐of‐the‐art multiblock method. SO‐CovSel has demonstrated to be suitable for its purpose: It has provided good predictions (both in regression and in classification) and, from the interpretation point of view, it has led to a meaningful selection of the original variables.
Even though NIR spectroscopy is based on the Beer–Lambert law, which clearly relates the concentration of the absorbing elements with the absorbance, the measured spectra are subject to spurious ...signals, such as additive and multiplicative effects. The use of NIR spectra, therefore, requires a preprocessing step. This article reviews the main preprocessing methods in the light of aquaphotomics. Simple methods for visualizing the spectra are proposed in order to guide the user in the choice of the best preprocessing. The most common chemometrics preprocessing are presented and illustrated by three real datasets. Some preprocessing aims to produce a spectrum as close as possible to the absorbance that would have been measured under ideal conditions and is very useful for the establishment of an aquagram. Others, dedicated to the improvement of the resolution of the spectra, are very useful for the identification of the peaks. Finally, special attention is given to the problem of reducing multiplicative effects and to the potential pitfalls of some very popular methods in chemometrics. Alternatives proposed in recent papers are presented.
In spectroscopy, multivariate calibrations more than often include a pre-processing step to reduce the effect of unwanted (not Y-related) sources of variability. Because there are many types of ...background noise, there are many pre-treatment methods. It is therefore tedious to select and/or combine the best pre-treatments. This article proposes to combine several pre-treatments through the use of sequential and orthogonalized partial least squares (SO-PLS), thus leading to a boosting method. The performances and properties of this new method, called Sequential Preprocessing through ORThogonalization (SPORT), are compared to those of a previously published stacking method. SPORT demonstrates very good calibration performances, but also the ability to make significant pretreatment selections.
•Associating spectra preprocessing via an ensemble method has been studied.•A sequential multi-block method allowed to extract complementary informations.•The proposed method implements a boosting approach.•The performances outperformed the ones reported in stacking approaches.•The method allows also to select the best preprocessing.
In multivariate calibrations, locally weighted partial least squared regression (LWPLSR) is an efficient prediction method when heterogeneity of data generates nonlinear relations (curvatures and ...clustering) between the response and the explicative variables. This is frequent in agronomic data sets that gather materials of different natures or origins. LWPLSR is a particular case of weighted PLSR (WPLSR; ie, a statistical weight different from the standard 1/n is given to each of the n calibration observations for calculating the PLS scores/loadings and the predictions). In LWPLSR, the weights depend from the dissimilarity (which has to be defined and calculated) to the new observation to predict. This article compares two strategies of LWPLSR: (a) “LW”: the usual strategy where, for each new observation to predict, a WPLSR is applied to the n calibration observations (ie, entire calibration set) vs (b) “KNN‐LW”: a number of k nearest neighbors to the observation to predict are preliminary selected in the training set and WPLSR is applied only to this selected KNN set. On three illustrating agronomic data sets (quantitative and discrimination predictions), both strategies overpassed the standard PLSR. LW and KNN‐LW had close prediction performances, but KNN‐LW was much faster in computation time. KNN‐LW strategy is therefore recommended for large data sets. The article also presents a new algorithm for WPLSR, on the basis of the “improved kernel #1” algorithm, which is competitor and in general faster to the already published weighted PLS nonlinear iterative partial least squares (NIPALS).
Locally weighted partial least squared regression (LWPLSR) is a particular case of weighted PLSR (WPLSR) where the weights, given to the calibration observations for calculating the PLS scores/loadings and the prediction, depend on the dissimilarity to the new observation to predict. This article compares two strategies of LWPLSR: (a) “LW”: the usual LWPLSR strategy vs (b) “KNN‐LW”: a number of k nearest neighbors to the observation to predict are preliminary selected and WPLSR is applied only to these neighbors.
Data generated by analytical instruments, such as spectrometers, may contain unwanted variation due to measurement mode, sample state and other external physical, chemical and environmental factors. ...Preprocessing is required so that the property of interest can be predicted correctly. Different correction methods may remove specific types of artefacts while still leaving some effects behind. Using multiple preprocessing in a complementary way can remove the artefacts that would be left behind by using only one technique. This article summarizes the recent developments in new data preprocessing strategies and specifically reviews the emerging ensemble approaches to preprocessing fusion in chemometrics. A demonstration case is also presented. In summary, ensemble preprocessing allows the selection of several techniques and their combinations that, in a complementary way, lead to improved models. Ensemble approaches are not limited to spectral data but can be used in all cases where preprocessing is needed and identification of a single best option is not easily done.
•New developments in the domain of data pre-processing are summarized.•Several new approaches to pre-processing optimization are discussed and compared.•Different preprocessings such as scatter correction methods carries complementary information.•Ensemble fusion allows the use of complementary information to boost chemometrics models.•Multi-block data analysis-based ensemble approaches are superior to other ensemble approaches.
•A new approach to NIR spectroscopy data processing is presented.•Different scatter correction techniques carry complementary information.•Pre-processing selection is now more necessary in NIR ...spectroscopy.•Fusion of scatter correction techniques is essential in NIR spectroscopy.
Near-infrared (NIR) spectra of pharmaceutical tablets get affected by light scattering phenomena, which mask the underlying peaks related to chemical components. Often the best performing scatter correction technique is selected from a pool of pre-selected techniques. However, the data corrected with different techniques may carry complementary information, hence, use of a single scatter correction technique is sub-optimal. In this study, the aim is to prove that NIR models related to pharmaceuticals can directly benefit from the fusion of complementary information extracted from multiple scatter correction techniques. To perform the fusion, sequential and parallel pre-processing fusion approaches were used. Two different open source NIR data sets were used for the demonstration where the assay uniformity and active ingredient (AI) content prediction was the aim. As a baseline, the fusion approach was compared to partial least-squares regression (PLSR) performed on standard normal variate (SNV) corrected data, which is a commonly used scatter correction technique. The results suggest that multiple scatter correction techniques extract complementary information and their complementary fusion is essential to obtain high-performance predictive models. In this study, the prediction error and bias were reduced by up to 15 % and 57 % respectively, compared to PLSR performed on SNV corrected data.
It is commonly accepted that species should move toward higher elevations and latitudes to track shifting isotherms as climate warms. However, temperature might not be the only limiting factor ...determining species distribution. Species might move to opposite directions to track changes in other climatic variables. Here, we used an extensive occurrence data set and an ensemble modelling approach to model the climatic niche and to predict the distribution of the seven baobab species (genus Adansonia) present in Madagascar. Using climatic projections from three global circulation models, we predicted species’ future distribution and extinction risk for 2055 and 2085 under two representative concentration pathways (RCPs) and two dispersal scenarios. We disentangled the role of each climatic variable in explaining species range shift looking at relative variable importance and future climatic anomalies. Four baobab species (Adansonia rubrostipa, Adansonia madagascariensis, Adansonia perrieri¸ and Adansonia suarezensis) could experience a severe range contraction in the future (>70% for year 2085 under RCP 8.5, assuming a zero‐dispersal hypothesis). For three out of the four threatened species, range contraction was mainly explained by an increase in temperature seasonality, especially in the North of Madagascar, where they are currently distributed. In tropical regions, where species are commonly adapted to low seasonality, we found that temperature seasonality will generally increase. It is, thus, very likely that many species in the tropics will be forced to move equatorward to avoid an increase in temperature seasonality. Yet, several ecological (e.g., equatorial limit, or unsuitable deforested habitat) or geographical barriers (absence of lands) could prevent species to move equatorward, thus increasing the extinction risk of many tropical species, like endemic baobab species in Madagascar.
We show that four out of the seven baobab species existing in Madagascar are threatened with extinction because of climate change. Among the four threatened species, three are adapted to low seasonality and should experience a dramatic range contraction by 2100 because of a strong increase in temperature seasonality. These three baobab species are expected to move equatorward to track change in temperature seasonality. We also show that a strong increase in temperature seasonality is expected throughout the tropics. Consequently, many tropical species adapted to low temperature seasonality should be forced to move equatorward to find a suitable climate. However, ecological and geographical barriers could impede tropical species dispersal and put them at risk of extinction.
Near-infrared (NIR) and mid-IR spectroscopy applied to soil compositional analysis started to develop markedly in the 1990s, taking advantage of earlier advances in instrumentation and chemometrics ...for agricultural products. Today, NIR spectroscopy is envisioned as replacing laboratory analysis in certain applications (e.g., soil-carbon-credit assessment at the farm level). However, accuracy is still unsatisfactory compared with standard laboratory procedures, leading some authors to think that such a challenge will never be met.
This article investigates the critical points to be aware of when accuracy of NIR-based measurements is assessed. First is the decomposition of the standard error of prediction into components of bias and variance, only the latter being reducible by averaging. This decomposition is not used routinely in the soil-science literature. Contrarily, a log-normal distribution of reference values is very often encountered with soil samples e.g., elemental concentrations (e.g., carbon) with numerous small or zero values. These very skewed distributions make us take precautions when using inverse regression methods (e.g., principal component regression or partial least squares), which force the predictions towards the centre of the calibration set, leading to negative effects on the standard error prediction – and therefore on prediction accuracy – especially when log-normal distributions are encountered. Such distributions, which are very common for soil components, also make the ratio of performance to deviation a useless, even hazardous, tool, leading to erroneous conclusions.
We propose a new index based on the quartiles of the empirical distribution – ratio of performance to inter-quartile distance – to overcome this problem
.
Direct‐injection mass spectrometry (DIMS) techniques have evolved into powerful methods to analyse volatile organic compounds (VOCs) without the need of chromatographic separation. Combined to ...chemometrics, they have been used in many domains to solve sample categorization issues based on volatilome determination. In this paper, different DIMS methods that have largely outperformed conventional electronic noses (e‐noses) in classification tasks are briefly reviewed, with an emphasis on food‐related applications. A particular attention is paid to proton transfer reaction mass spectrometry (PTR‐MS), and many results obtained using the powerful PTR‐time of flight‐MS (PTR‐ToF‐MS) instrument are reviewed. Data analysis and feature selection issues are also summarized and discussed. As a case study, a challenging problem of classification of dark chocolates that has been previously assessed by sensory evaluation in four distinct categories is presented. The VOC profiles of a set of 206 chocolate samples classified in the four sensory categories were analysed by PTR‐ToF‐MS. A supervised multivariate data analysis based on partial least squares regression‐discriminant analysis allowed the construction of a classification model that showed excellent prediction capability: 97% of a test set of 62 samples were correctly predicted in the sensory categories. Tentative identification of ions aided characterisation of chocolate classes. Variable selection using dedicated methods pinpointed some volatile compounds important for the discrimination of the chocolates. Among them, the CovSel method was used for the first time on PTR‐MS data resulting in a selection of 10 features that allowed a good prediction to be achieved. Finally, challenges and future needs in the field are discussed.