Model validation is the most important part of building a supervised model. For building a model with good generalization performance one must have a sensible data splitting strategy, and this is ...crucial for model validation. In this study, we conducted a comparative study on various reported data splitting methods. The MixSim model was employed to generate nine simulated datasets with different probabilities of mis-classification and variable sample sizes. Then partial least squares for discriminant analysis and support vector machines for classification were applied to these datasets. Data splitting methods tested included variants of cross-validation, bootstrapping, bootstrapped Latin partition, Kennard-Stone algorithm (K-S) and sample set partitioning based on joint
X
–
Y
distances algorithm (SPXY). These methods were employed to split the data into training and validation sets. The estimated generalization performances from the validation sets were then compared with the ones obtained from the blind test sets which were generated from the same distribution but were unseen by the training/validation procedure used in model construction. The results showed that the size of the data is the deciding factor for the qualities of the generalization performance estimated from the validation set. We found that there was a significant gap between the performance estimated from the validation set and the one from the test set for the all the data splitting methods employed on small datasets. Such disparity decreased when more samples were available for training/validation, and this is because the models were then moving towards approximations of the central limit theory for the simulated datasets used. We also found that having too many or too few samples in the training set had a negative effect on the estimated model performance, suggesting that it is necessary to have a good balance between the sizes of training set and validation set to have a reliable estimation of model performance. We also found that systematic sampling method such as K-S and SPXY generally had very poor estimation of the model performance, most likely due to the fact that they are designed to take the most representative samples first and thus left a rather poorly representative sample set for model performance estimation.
Parkinson's disease (PD) is a progressive neurodegenerative disorder, which is characterised by degeneration of distinct neuronal populations, including dopaminergic neurons of the substantia nigra. ...Here, we use a metabolomics profiling approach to identify changes to lipids in PD observed in sebum, a non-invasively available biofluid. We used liquid chromatography-mass spectrometry (LC-MS) to analyse 274 samples from participants (80 drug naïve PD, 138 medicated PD and 56 well matched control subjects) and detected metabolites that could predict PD phenotype. Pathway enrichment analysis shows alterations in lipid metabolism related to the carnitine shuttle, sphingolipid metabolism, arachidonic acid metabolism and fatty acid biosynthesis. This study shows sebum can be used to identify potential biomarkers for PD.
Ninety-four years have passed since the discovery of the Raman effect, and there are currently more than 25 different types of Raman-based techniques. The past two decades have witnessed the ...blossoming of Raman spectroscopy as a powerful physicochemical technique with broad applications within the life sciences. In this review, we critique the use of Raman spectroscopy as a tool for quantitative metabolomics. We overview recent developments of Raman spectroscopy for identification and quantification of disease biomarkers in liquid biopsies, with a focus on the recent advances within surface-enhanced Raman scattering-based methods. Ultimately, we discuss the applications of imaging modalities based on Raman scattering as label-free methods to study the abundance and distribution of biomolecules in cells and tissues, including mammalian, algal, and bacterial cells.
Display omitted
•PLS-DA, PC-DFA, SVM and RF analyses were compared for metabolomics analyses.•Parsimonious models for feature selection and data reduction were presented.•Comparisons include ...generally recognized pros along with specific caveats for each of the methods.•Statistical models applied in the analysis of metabolomics data were shown.•Pros and cons of common analytical techniques used in metabolomics studies are highlighted.
The predominance of partial least squares-discriminant analysis (PLS-DA) used to analyze metabolomics datasets (indeed, it is the most well-known tool to perform classification and regression in metabolomics), can be said to have led to the point that not all researchers are fully aware of alternative multivariate classification algorithms. This may in part be due to the widespread availability of PLS-DA in most of the well-known statistical software packages, where its implementation is very easy if the default settings are used. In addition, one of the perceived advantages of PLS-DA is that it has the ability to analyze highly collinear and noisy data. Furthermore, the calibration model is known to provide a variety of useful statistics, such as prediction accuracy as well as scores and loadings plots. However, this method may provide misleading results, largely due to a lack of suitable statistical validation, when used by non-experts who are not aware of its potential limitations when used in conjunction with metabolomics. This tutorial review aims to provide an introductory overview to several straightforward statistical methods such as principal component-discriminant function analysis (PC-DFA), support vector machines (SVM) and random forests (RF), which could very easily be used either to augment PLS or as alternative supervised learning methods to PLS-DA. These methods can be said to be particularly appropriate for the analysis of large, highly-complex data sets which are common output(s) in metabolomics studies where the numbers of variables often far exceed the number of samples. In addition, these alternative techniques may be useful tools for generating parsimonious models through feature selection and data reduction, as well as providing more propitious results. We sincerely hope that the general reader is left with little doubt that there are several promising and readily available alternatives to PLS-DA, to analyze large and highly complex data sets.
Background
Quality assurance (QA) and quality control (QC) are two quality management processes that are integral to the success of metabolomics including their application for the acquisition of ...high quality data in any high-throughput analytical chemistry laboratory. QA defines all the planned and systematic activities implemented before samples are collected, to provide confidence that a subsequent analytical process will fulfil predetermined requirements for quality. QC can be defined as the operational techniques and activities used to measure and report these quality requirements after data acquisition.
Aim of review
This tutorial review will guide the reader through the use of system suitability and QC samples, why these samples should be applied and how the quality of data can be reported.
Key scientific concepts of review
System suitability samples are applied to assess the operation and lack of contamination of the analytical platform prior to sample analysis. Isotopically-labelled internal standards are applied to assess system stability for each sample analysed. Pooled QC samples are applied to condition the analytical platform, perform intra-study reproducibility measurements (QC) and to correct mathematically for systematic errors. Standard reference materials and long-term reference QC samples are applied for inter-study and inter-laboratory assessment of data.
We report the use of a novel technology based on optical photothermal infrared (O-PTIR) spectroscopy for obtaining simultaneous infrared and Raman spectra from the same location of the sample ...allowing us to study bacterial metabolism by monitoring the incorporation of
C- and
N-labeled compounds. Infrared data obtained from bulk populations and single cells via O-PTIR spectroscopy were compared to conventional Fourier transform infrared (FTIR) spectroscopy in order to evaluate the reproducibility of the results achieved by all three approaches. Raman spectra acquired were concomitant with infrared data from bulk populations as well as infrared spectra collected from single cells, and were subjected to principal component analysis in order to evaluate any specific separation resulting from the isotopic incorporation. Similar clustering patterns were observed in infrared data acquired from single cells via O-PTIR spectroscopy as well as from bulk populations via FTIR and O-PTIR spectroscopies, indicating full incorporation of heavy isotopes by the bacteria. Satisfactory discrimination between unlabeled (
.
C
N),
C
N- and
C
N-labeled bacteria was also obtained using Raman spectra from bulk populations. In this report, we also discuss the limitations of O-PTIR technology to acquire Raman data from single bacterial cells (with typical dimensions of 1 × 2 µm) as well as spectral artifacts induced by thermal damage when analyzing very small amounts of biomass (a bacterium tipically weighs ~ 1 pg).