Display omitted
► Omics studies generate massive data obtained from different analytical devices. ► Extracting knowledge from these multiple blocks is challenging. ► A generic methodology for Omics ...data fusion is proposed.
Omics approaches have proven their value to provide a broad monitoring of biological systems. However, as no single analytical technique is sufficient to reveal the full biochemical content of complex biological matrices or biofluids, the fusion of information from several data sources has become a decisive issue. Omics studies generate an increasing amount of massive data obtained from different analytical devices. These data are usually high dimensional and extracting knowledge from these multiple blocks is challenging. Appropriate tools are therefore needed to handle these datasets suitably. For that purpose, a generic methodology is proposed by combining the strengths of established data analysis strategies, i.e. multiple kernel learning and OPLS-DA to offer an efficient tool for the fusion of Omics data obtained from multiple sources. Three real case studies are proposed to assess the potential of the method. A first example illustrates the fusion of mass spectrometry-based metabolomic data acquired in both negative and positive electrospray ionisation modes, from leaf samples of the model plant Arabidopsis thaliana. A second dataset involves the classification of wine grape varieties based on polyphenolic extracts analysed by two-dimensional heteronuclear magnetic resonance spectroscopy. A third case study underlines the ability of the method to combine heterogeneous data from systems biology with the analysis of publicly available data related to NCI-60 cancer cell lines from different tissue origins, which include metabolomics, transcriptomics and proteomics.
The fusion of Omics data from different sources is expected to provide a more complete view of biological systems. The proposed method was demonstrated as a relevant and widely applicable alternative to handle efficiently the inherent characteristics of multiple Omics data, such as very large numbers of noisy collinear variables.
Data generated by analytical instruments, such as spectrometers, may contain unwanted variation due to measurement mode, sample state and other external physical, chemical and environmental factors. ...Preprocessing is required so that the property of interest can be predicted correctly. Different correction methods may remove specific types of artefacts while still leaving some effects behind. Using multiple preprocessing in a complementary way can remove the artefacts that would be left behind by using only one technique. This article summarizes the recent developments in new data preprocessing strategies and specifically reviews the emerging ensemble approaches to preprocessing fusion in chemometrics. A demonstration case is also presented. In summary, ensemble preprocessing allows the selection of several techniques and their combinations that, in a complementary way, lead to improved models. Ensemble approaches are not limited to spectral data but can be used in all cases where preprocessing is needed and identification of a single best option is not easily done.
•New developments in the domain of data pre-processing are summarized.•Several new approaches to pre-processing optimization are discussed and compared.•Different preprocessings such as scatter correction methods carries complementary information.•Ensemble fusion allows the use of complementary information to boost chemometrics models.•Multi-block data analysis-based ensemble approaches are superior to other ensemble approaches.
•Two approaches for correcting external influences from NIR data are presented.•Effects due to different instruments, temperatures and seasons are corrected.•All corrections were performed without ...any standard sample measurements.•Real datasets of olive, mango, and apple fruit were used for demonstration.•In all case, the models after correction worked better on the new batch.
In near-infrared (NIR) spectroscopy of fresh fruit often the external influences due to differences in physical, chemical and environmental conditions lead to model failure. Correction methods are required where standard samples are measured covering all different conditions and then remodeling is performed. However, in the real-world, it is often difficult to measure standard samples. To deal with this, two different approaches to correct for external influences without standard sample measurements i.e., dynamic orthogonalization projection (DOP) and domain adaption (DA), are presented, and for the first time are applied to NIR spectroscopy of fresh fruit. Four different case studies, chosen based on their importance and their frequency of occurrences in the NIR spectroscopy domain, were used for the demonstration. The first case was an adaption to maintain the predictive performance of a model when used on a spectra from a second similar instrument. The second case was the correction of the temperature effects due to sensor heating. The third and fourth cases were about maintaining the model performance for multi-season fruit quality prediction models for mangos and for apples. In all of the cases, the aim was to solve the challenges without resorting to new measurement of standards. The results showed that for all the cases, both DOP and DA improved model performances. Up to 31% increase in R2p, and 98% and 66% reduction in prediction bias and root mean squared error (RMSE) of prediction were noted, respectively. The main benefit of the DOP and DA techniques in NIR spectroscopy is the limited need for standard measurements, providing general-purpose tools to complement the NIR spectroscopy and make the models scalable, transferable, and reusable.
Among all the software packages available for discriminant analyses based on projection to latent structures (PLS-DA) or orthogonal projection to latent structures (OPLS-DA), SIMCA (Umetrics, Umeå ...Sweden) is the more widely used in the metabolomics field. SIMCA proposes many parameters or tests to assess the quality of the computed model (the number of significant components, R2, Q2, pCV-ANOVA, and the permutation test). Significance thresholds for these parameters are strongly application-dependent. Concerning the Q2 parameter, a significance threshold of 0.5 is generally admitted. However, during the last few years, many PLS-DA/OPLS-DA models built using SIMCA have been published with Q2 values lower than 0.5. The purpose of this opinion note is to point out that, in some circumstances frequently encountered in metabolomics, the values of these parameters strongly depend on the individuals that constitute the validation subsets. As a result of the way in which the software selects members of the calibration and validation subsets, a simple permutation of dataset rows can, in several cases, lead to contradictory conclusions about the significance of the models when a K-fold cross-validation is used. We believe that, when Q2 values lower than 0.5 are obtained, SIMCA users should at least verify that the quality parameters are stable towards permutation of the rows in their dataset.
Chemometrics pre-processing of spectral data is widely performed to enhance the predictive performance of near-infrared (NIR) models related to fresh fruit quality. Pre-processing approaches in the ...domain of NIR data analysis are used to remove the scattering effects, thus, enhancing the absorption components related to the chemical properties. However, in the case of fresh fruit, both the scattering and absorption properties are of key interest as they jointly explain the physicochemical state of a fruit. Therefore, pre-processing data that reduces the scattering information in the spectra may lead to poorly performing models. The objectives of this study are to test two hypotheses to explore the effect of pre-processing on NIR spectra of fresh fruit. The first hypothesis is that the pre-processing of NIR spectra with scatter correction techniques can reduce the predictive performance of models as the scatter correction can reduce the useful scattering information correlated to the property of interest. The second hypothesis is that the Deep Learning (DL) can model the raw absorbance data (mix of scattering and absorption) much more efficiently than the Partial Least Squares (PLS) regression analysis. To test the hypotheses, a real NIR data set related to dry matter (DM) prediction in mango fruit was used. The dataset consisted of a total of 11,420 NIR spectra and reference DM measurements for model training and independent testing. The chemometric pre-processing methods explored were standard normal variate (SNV), variable sorting for normalization (VSN), Savitzky-Golay based 2nd derivative and their combinations. Further two modelling approaches i.e., PLS regression and DL were used to evaluate the effect of pre-processing. The results showed that the best root mean squared error of prediction (RMSEP) for both the PLS and DL models were obtained with the raw absorbance data. The spectral pre-processing in general decreased the performance of both the PLS and DL models. Further, the DL model attained the lowest RMSEP of 0.76%, which was 13% lower compared to the PLS regression on the raw absorbance data. Pre-processing approaches should be carefully used while analysing the NIR data related to fresh fruit.
Display omitted
•Effect of chemometric pre-processing on NIR modelling of fresh fruit is presented.•Both chemometric and deep learning models were explored.•Chemometric pre-processing degraded NIR model.
Independent components analysis (ICA) is a probabilistic method, whose goal is to extract underlying component signals, that are maximally independent and non-Gaussian, from mixed observed signals. ...Since the data acquired in many applications in analytical chemistry are mixtures of component signals, such a method is of great interest. In this article recent ICA applications for quantitative and qualitative analysis in analytical chemistry are reviewed. The following experimental techniques are covered: fluorescence, UV-VIS, NMR, vibrational spectroscopies as well as chromatographic profiles. Furthermore, we reviewed ICA as a preprocessing tool as well as existing hybrid ICA-based multivariate approaches. Finally, further research directions are proposed. Our review shows that ICA is starting to play an important role in analytical chemistry, and this will definitely increase in the future.
Display omitted
•ICA applications for spectral and chromatographic data modeling are reviewed.•Applications for food, drug and environmental control are covered.•ICA plays an important role in data preprocessing.•Hybrid ICA-based approaches have a potential in analytical chemistry.
Anaerobic digestion (AD) is used to minimize solid waste while producing biogas by the action of microorganisms. To give an insight into the underlying microbial dynamics in anaerobic digesters, we ...investigated two different AD systems (wastewater sludge mixed with either fish or grass waste). The microbial activity was characterized by 16S RNA sequencing. 16S data is sparse and dispersed, and existent data analysis methods do not take into account this complexity nor the potential microbial interactions. In this line, we proposed a data pre-processing pipeline addressing these issues while not restricting only to the most abundant microorganisms. The data were analyzed by Common Components Analysis (CCA) to decipher the effect of substrate composition on the microorganisms. CCA results hinted the relationships between the microorganisms responding similarly to the AD physicochemical parameters. Thus, in overall, CCA allowed a better understanding of the inter-species interactions within microbial communities.
The most commonly used technique to prepare samples for the analysis of wine volatile is the headspace solid-phase microextraction (HS-SPME). This method has gained popularity in last few years, as ...it is a unique solventless preparation technique. In this paper, a summary of recently published studies using HS-SPME for the analysis of wine aromas, with special emphasis on the method developed, has been compiled. Several papers are discussed in detail, mainly with respect to the SPME conditions used. A brief summary of the reviews related to HS-SPME analysis is given and discussed. Several parameters affecting the HS-SPME, such as the salt concentration and the agitation conditions, are used in the same way as used in several papers. The HS-SPME extraction proved to be sufficiently sensitive to satisfy legislative requirements related to low detection and quantification limits as well as method accuracy and precision requirements. However, in order to achieve the best performance and precision, the protocol needs to be optimized for each case. The effect of different parameters must be well characterized to ensure correct extraction and desorption to ensure the transfer of extracted compounds into the analytical system. The operating parameters, such as time, temperature, and agitation, must then be kept constant for all the samples.
•Shearlet-based automatic de-noising method.•Method intelligently adapts to type of noise.•Supports use of HSI for process analysis.•Can deal with noise present in consecutive wavelengths.
...Hyperspectral imaging (HSI) has become an essential tool for exploration of different spatially-resolved properties of materials in analytical chemistry. However, due to various technical factors such as detector sensitivity, choice of light source and experimental conditions, the recorded data contain noise. The presence of noise in the data limits the potential of different data processing tasks such as classification and can even make them ineffective. Therefore, reduction/removal of noise from the data is a useful step to improve the data modelling. In the present work, the potential of a wavelength-specific shearlet-based image noise reduction method was utilised for automatic de-noising of close-range HS images. The shearlet transform is a special type of composite wavelet transform that utilises the shearing properties of the images. The method first utilises the spectral correlation between wavelengths to distinguish between levels of noise present in different image planes of the data cube. Based on the level of noise present, the method adapts the use of the 2-D non-subsampled shearlet transform (NSST) coefficients obtained from each image plane to perform the spatial and spectral de-noising. Furthermore, the method was compared with two commonly used pixel-based spectral de-noising techniques, Savitzky-Golay (SAVGOL) smoothing and median filtering. The methods were compared using simulated data, with Gaussian and Gaussian and spike noise added, and real HSI data. As an application, the methods were tested to determine the efficacy of a visible-near infrared (VNIR) HSI camera to perform non-destructive automatic classification of six commercial tea products. De-noising with the shearlet-based method resulted in a visual improvement in the quality of the noisy image planes and the spectra of simulated and real HSI. The spectral correlation was highest with the shearlet-based method. The peak signal-to-noise ratio (PSNR) obtained using the shearlet-based method was higher than that for SAVGOL smoothing and median filtering. There was a clear improvement in the classification accuracy of the SVM models for both the simulated and real HSI data that had been de-noised using the shearlet-based method. The method presented is a promising technique for automatic de-noising of close-range HS images, especially when the amount of noise present is high and in consecutive wavelengths.
•Fusion of scatter correction for NIRS modelling.•SPORT approach was used for the fusion.•Different scatter correction techniques provided complementary information.•Fusion works better in diffuse ...reflectance than transmittance measurement mode.
Near-infrared spectroscopy (NIRS) is a key non-destructive technique for rapid assessment of the chemical properties of food materials. However, a major challenge with NIRS is the mixed physicochemical phenomena captured by the interaction of the light with the matter. The interaction often results in both absorption and scattering of the light. The overall NIRS signal therefore contains information related to the two phenomena mixed. To predict chemical properties such as dry matter, Brix and lipids, light refelction/absorption is used. Therefore, when the aim of the data analysis is to predict chemical components, it is necessary to remove as much as possible the scattering effects from the spectra. Several pre-processing techniques are available to do this, but it is often difficult to decide which one to choose. In this article we present the use of a recently developed pre-processing approach, sequential pre-processing through orthogonalization (SPORT), to improve the predictive power of multivariate models based on NIR spectra of food materials. The SPORT approach utilizes sequential orthogonalized partial least square regression (SOPLS) for the fusion of data blocks corresponding to several spectral preprocessing techniques. The results were compared with commonly used pre-processing techniques in the analysis of food materials by NIRS. The comparison was made by analyzing 5 different datasets comprised of apples, apricots, olive oils and grapes associated with chemical properties such as dry matter (DM), Brix, lipids and citric acid. The datasets were from both reflection and transmission measurements. The results showed that the fusion-based pre-processing methodology is an ideal choice for pre-processing of NIRS data. For four out of five datasets, the prediction accuracies (high R2pred and low RMSEP) were improved. The improvement led to as much as a 20 % increase in R2pred and a 25 % decrease in RMSEP compared to the standard 2nd derivative pre-processing. The pre-processing fusion was more effective for the reflection mode compared to the transmission mode. Multiple pre-processing techniques provided complementary information, and therefore, their fusion using the SPORT approach improved the model performance. The methodology is not only applicable to food materials but can in fact be used as a general pre-processing approach for all types of modeling of spectral data.