Many modern analytical methods are used to analyse samples coming from an experimental design, for example, in medical, biological, or agronomic fields. Those methods generate most of the time highly ...multivariate data like spectra or images. This is the case of “omics” technologies used to detect genes (genomics), mRNA (transcriptomics), proteins (proteomics), or metabolites (metabolomics) in a specific biological sample. Those technologies produce high‐dimensional multivariate databases where the number of variables (descriptors) tends to be much larger than the number of experimental units. Moreover, experiments in omics often follow designs aimed at understanding the effect of several factors on biological systems. Therefore, multivariate statistical tools are needed to highlight variables that are consistently modified by different biological states. It is in this context that 2 recent methods combine analysis of variance (ANOVA) and principal component analysis (PCA), namely, ASCA (ANOVA–simultaneous component analysis) and APCA (ANOVA‐PCA). They provide powerful tools to visualize multivariate structures in the space of each effect of the statistical model linked to the experimental design. Their main limitation is that they provide biased estimators of the factor effects when the design of experiment is unbalanced. This paper introduces 2 new methods, ASCA+ and APCA+, that allow, respectively, to extend the use of ASCA and APCA to unbalanced designs using several principles from the theory of general linear models. Both methods are applied on real‐life metabolomics data, clearly demonstrating the capacity of ASCA+ and APCA+ methods to highlight correct biomarkers corresponding to effects of interest in unbalanced designs.
This paper presents 2 new methods: ASCA+ and APCA+ that allow, respectively, to extend the use of ASCA and APCA to unbalanced designs. Those new methods rely on the principle of the general linear model to estimate factor effects with least squares rather than with simple differences of means as proposed by classical ASCA and APCA. Their application on real‐life metabolomics data shows their advantage in highlighting biomarkers corresponding to a factor of interest in unbalanced designs.
Introduction
The use of 2D NMR data sources (COSY in this paper) allows to reach general metabolomics results which are at least as good as the results obtained with 1D NMR data, and this with a less ...advanced and less complex level of pre-processing. But a major issue still exists and can largely slow down a generalized use of 2D data sources in metabolomics: the experiment duration.
Objective
The goal of this paper is to overcome the experiment duration issue in our recently published MIC strategy by considering faster 2D COSY acquisition techniques: a conventional COSY with a reduced number of transients and the use of the Non-Uniform Sampling (NUS) method. These faster alternatives are all submitted to novel 2D pre-processing workflows and to Metabolomic Informative Content analyses. Eventually, results are compared to those obtained with conventional COSY spectra.
Methods
To pre-process the 2D data sources, the Global Peak List (GPL) workflow and the Vectorization workflow are used. To compare this data sources and to detect the more informative one(s), MIC (Metabolomic Informative Content) indexes are used, based on clustering and inertia measures of quality.
Results
Results are discussed according to a multi-factor experimental design (which is unsupervised and based on human urine samples). Descriptive PCA results and MIC indexes are shown, leading to the direct and objective comparison of the different data sets.
Conclusion
In conclusion, it is demonstrated that conventional COSY spectra recorded with only one transient per increment and COSY spectra recorded with 50% of non-uniform sampling provide very similar MIC results as the initial COSY recorded with four transients, but in a much shorter time. Consequently, using techniques like the reduction of the number of transients or NUS can really open the door to a potential high-throughput use of 2D COSY spectra in metabolomics.
Introduction
The pre-processing of analytical data in metabolomics must be considered as a whole to allow the construction of a global and unique object for any further simultaneous data analysis or ...multivariate statistical modelling. For 1D
1
H-NMR metabolomics experiments, best practices for data pre-processing are well defined, but not yet for 2D experiments (for instance COSY in this paper).
Objective
By considering the added value of a second dimension, the objective is to propose two workflows dedicated to 2D NMR data handling and preparation (the Global Peak List and Vectorization approaches) and to compare them (with respect to each other and with 1D standards). This will allow to detect which methodology is the best in terms of amount of metabolomic content and to explore the advantages of the selected workflow in distinguishing among treatment groups and identifying relevant biomarkers. Therefore, this paper explores both the necessity of novel 2D pre-processing workflows, the evaluation of their quality and the evaluation of their performance in the subsequent determination of accurate (2D) biomarkers.
Methods
To select the more informative data source, MIC (Metabolomic Informative Content) indexes are used, based on clustering and inertia measures of quality. Then, to highlight biomarkers or critical spectral zones, the PLS-DA model is used, along with more advanced sparse algorithms (sPLS and L-sOPLS).
Results
Results are discussed according to two different experimental designs (one which is unsupervised and based on human urine samples, and the other which is controlled and based on spiked serum media). MIC indexes are shown, leading to the choice of the more relevant workflow to use thereafter. Finally, biomarkers are provided for each case and the predictive power of each candidate model is assessed with cross-validated measures of RMSEP.
Conclusion
In conclusion, it is shown that no solution can be universally the best in every case, but that 2D experiments allow to clearly find relevant cross peak biomarkers even with a poor initial separability between groups. The MIC measures linked with the candidate workflows (2D GPL, 2D vectorization, 1D, and with specific parameters) lead to visualize which data set must be used as a priority to more easily find biomarkers. The diversity of data sources, mainly 1D versus 2D, may often lead to complementary or confirmatory results.
Introduction
In the context of metabolomics analyses, partial least squares (PLS) represents the standard tool to perform regression and classification. OPLS, the Orthogonal extension of PLS which ...has proved to be very useful when interpretation is the main issue, is a more recent way to decompose the PLS solution into predictive components correlated to the target
Y
and components pertaining to the data
X
but uncorrelated to
Y
. This predominance of (O)PLS can raise the question of the awareness of alternative multivariate regression and/or classification tools able to find biomarkers. Actually, the search for biomarkers remains a key issue in metabolomics as it is crucial to very accurately target discriminating features.
Objective
Most of the time, (O)PLS methods perform well but a drawback often occurs: too many variables can be selected as potential biomarkers even using adapted statistical significance tests. However, for final users (in medical studies for instance), it can be advantageous to deal with only a small number of easily interpretable biomarkers.
Methods
This drawback is approached in this paper via the use of sparse methods. The sparse-PLS (sPLS), an extension of PLS which promotes an inner variable/feature selection, is an interesting existing solution. But a new intuitive algorithm is proposed in this paper to combine sparsity and the advantages of an orthogonalization step: the “Light-sparse-OPLS” (L-sOPLS). L-sOPLS promotes sparsity on a previously optimized deflated matrix which implies the removal of the
Y
-orthogonal components.
Results
A discussion around the compromise between sparsity and predictive modelling performances is provided and it is shown that L-sOPLS produces convincing results, illustrated principally on the basis of
1
H-NMR spectral data but also on genomic RT-qPCR data.
Conclusion
The L-sOPLS algorithm allows to reach better predictive performances than (O)PLS and sPLS while taking into account only a very small number of relevant descriptors.
Compared with the widely used
1
H-NMR spectroscopy, two-dimensional NMR experiments provide more sophisticated spectra which should facilitate the identification of relevant spectral zones or ...biomarkers in metabolomics. This paper focuses on
1
H-
1
H COrrelation SpectroscopY (COSY) spectral data. In spite of longer inherent acquisition times, it is commonly accepted by users (biologists, healthcare professionals) that the introduction of an additional dimension probably represents a huge qualitative step for investigations in terms of metabolites identification. Moreover, it seems natural that more information leads to more predictive power. But, until now, very few statistical studies clearly proved this assumption. Therefore a fundamental question is “Is this supplementary information relevant?”. In order to extend the statistical properties developed for 1D spectroscopy to the challenges raised by 2D spectra, a rigorous study of the performances of COSY spectra is needed as a prerequisite. Having introduced new pre-processing concepts, such as the Global Peak List or an ad hoc 2D “bucketing”, this paper presents an innovative methodology based on multivariate clustering algorithms to evaluate this question. Numerical clustering quality indexes and graphical results are proposed, based both on the spectral presence or absence of peaks (binary position vectors) and on peak intensities, and through different levels of spectral resolution. The second goal of this paper is to compare clustering performances obtained on COSY and on
1
H-NMR spectra, with the aim of understanding to what extent the COSY spectra carry more Metabolomic Informative Content about the signal than 1D ones. The methodology is applied to two real experimental designs involving different groups of spectra (which define the signal): a 4-mixture cell culture media containing various supervised metabolites and a complex human serum based design. It is shown that COSY spectra appear to be statistically powerful and, in addition, provide better clustering results than corresponding
1
H-NMR when using unlabeled information. Consequently, additional information appears to be relevant for metabolomics applications.
Compared with the widely used ^sup 1^H-NMR spectroscopy, two-dimensional NMR experiments provide more sophisticated spectra which should facilitate the identification of relevant spectral zones or ...biomarkers in metabolomics. This paper focuses on ^sup 1^H-^sup 1^H COrrelation SpectroscopY (COSY) spectral data. In spite of longer inherent acquisition times, it is commonly accepted by users (biologists, healthcare professionals) that the introduction of an additional dimension probably represents a huge qualitative step for investigations in terms of metabolites identification. Moreover, it seems natural that more information leads to more predictive power. But, until now, very few statistical studies clearly proved this assumption. Therefore a fundamental question is "Is this supplementary information relevant?". In order to extend the statistical properties developed for 1D spectroscopy to the challenges raised by 2D spectra, a rigorous study of the performances of COSY spectra is needed as a prerequisite. Having introduced new pre-processing concepts, such as the Global Peak List or an ad hoc 2D "bucketing", this paper presents an innovative methodology based on multivariate clustering algorithms to evaluate this question. Numerical clustering quality indexes and graphical results are proposed, based both on the spectral presence or absence of peaks (binary position vectors) and on peak intensities, and through different levels of spectral resolution. The second goal of this paper is to compare clustering performances obtained on COSY and on ^sup 1^H-NMR spectra, with the aim of understanding to what extent the COSY spectra carry more Metabolomic Informative Content about the signal than 1D ones. The methodology is applied to two real experimental designs involving different groups of spectra (which define the signal): a 4-mixture cell culture media containing various supervised metabolites and a complex human serum based design. It is shown that COSY spectra appear to be statistically powerful and, in addition, provide better clustering results than corresponding ^sup 1^H-NMR when using unlabeled information. Consequently, additional information appears to be relevant for metabolomics applications.
Les XXVIIèmes journées du longitudinal (JDL) proposent d’éclairer les carrières scolaires et professionnelles contemporaines. Elles visent à questionner l’articulation entre choix individuels et ...structures sociales, actualiser les connaissances sur ce champ de recherche et discuter des nouvelles perspectives ouvertes par les avancées méthodologiques que ce champ a connu depuis les années 2000. Les contributions présentées dans cet ouvrage permettent ainsi de mieux comprendre les ruptures et les continuités des trajectoires, au regard notamment de l’impact des crises, d’analyser les différentes étapes de la carrière du point de vue des individus et des contraintes du marché du travail, à travers une diversité d’approches : articulation de méthodes quantitatives habituellement cloisonnées, de méthodes quantitatives et qualitatives ou encore l’utilisation de nouvelles sources de données (big data). Ces journées sont organisées par le laboratoire PACTE (UMR 5194, IEP-CNRS-Université Grenoble Alpes), centre associé du Céreq, avec la participation du laboratoire LaRAC (EA 602, Université Grenoble Alpes). Chaque année, les JDL, organisées par le Céreq ou un de ses centres associés, réunissent des chercheurs et chercheuses autour d’une problématique inscrite dans une approche longitudinale de l’analyse de la relation formation-emploi. Les actes des rencontres sont édités tous les ans par le Céreq.
Introduction La formation continue des salariés figure, en Belgique, parmi les priorités des autorités politiques des différents niveaux de pouvoir. Au niveau européen, la stratégie de Lisbonne avait ...renforcé cette injonction par l’adoption d’un indicateur d’objectif (12,5 % de la population adulte), avant qu’EU 2020 n’en fixe en termes d’accession des adultes à un diplôme de troisième cycle. Le gouvernement fédéral a fixé un objectif de long terme de 5 jours de formation par an pour les sala...