Introduction
Metabolomics is increasingly being used in the clinical setting for disease diagnosis, prognosis and risk prediction. Machine learning algorithms are particularly important in the ...construction of multivariate metabolite prediction. Historically, partial least squares (PLS) regression has been the gold standard for binary classification. Nonlinear machine learning methods such as random forests (RF), kernel support vector machines (SVM) and artificial neural networks (ANN) may be more suited to modelling possible nonlinear metabolite covariance, and thus provide better predictive models.
Objectives
We hypothesise that for binary classification using metabolomics data, non-linear machine learning methods will provide superior generalised predictive ability when compared to linear alternatives, in particular when compared with the current gold standard PLS discriminant analysis.
Methods
We compared the general predictive performance of eight archetypal machine learning algorithms across ten publicly available clinical metabolomics data sets. The algorithms were implemented in the Python programming language. All code and results have been made publicly available as Jupyter notebooks.
Results
There was only marginal improvement in predictive ability for SVM and ANN over PLS across all data sets. RF performance was comparatively poor. The use of out-of-bag bootstrap confidence intervals provided a measure of uncertainty of model prediction such that the quality of metabolomics data was observed to be a bigger influence on generalised performance than model choice.
Conclusion
The size of the data set, and choice of performance metric, had a greater influence on generalised predictive performance than the choice of machine learning algorithm.
Background
Metabolomics data, with its complex covariance structure, is typically modelled by projection-based machine learning (ML) methods such as partial least squares (PLS) regression, which ...project data into a latent structure. Biological data are often non-linear, so it is reasonable to hypothesize that metabolomics data may also have a non-linear latent structure, which in turn would be best modelled using non-linear equations. A non-linear ML method with a similar projection equation structure to PLS is artificial neural networks (ANNs). While ANNs were first applied to metabolic profiling data in the 1990s, the lack of community acceptance combined with limitations in computational capacity and the lack of volume of data for robust non-linear model optimisation inhibited their widespread use. Due to recent advances in computational power, modelling improvements, community acceptance, and the more demanding needs for data science, ANNs have made a recent resurgence in interest across research communities, including a small yet growing usage in metabolomics. As metabolomics experiments become more complex and start to be integrated with other omics data, there is potential for ANNs to become a viable alternative to linear projection methods.
Aim of review
We aim to first describe ANNs and their structural equivalence to linear projection-based methods, including PLS regression. We then review the historical, current, and future uses of ANNs in the field of metabolomics.
Key scientific concept of review
Is metabolomics ready for the return of artificial neural networks?
Background
A lack of transparency and reporting standards in the scientific community has led to increasing and widespread concerns relating to reproduction and integrity of results. As an omics ...science, which generates vast amounts of data and relies heavily on data science for deriving biological meaning, metabolomics is highly vulnerable to irreproducibility. The metabolomics community has made substantial efforts to align with FAIR data standards by promoting open data formats, data repositories, online spectral libraries, and metabolite databases. Open data analysis platforms also exist; however, they tend to be inflexible and rely on the user to adequately report their methods and results. To enable FAIR data science in metabolomics, methods and results need to be transparently disseminated in a manner that is rapid, reusable, and fully integrated with the published work. To ensure broad use within the community such a framework also needs to be inclusive and intuitive for both computational novices and experts alike.
Aim of Review
To encourage metabolomics researchers from all backgrounds to take control of their own data science, mould it to their personal requirements, and enthusiastically share resources through open science.
Key Scientific Concepts of Review
This tutorial introduces the concept of interactive web-based computational laboratory notebooks. The reader is guided through a set of experiential tutorials specifically targeted at metabolomics researchers, based around the Jupyter Notebook web application, GitHub data repository, and Binder cloud computing platform.
Introduction
Metabolomics data is commonly modelled multivariately using partial least squares discriminant analysis (PLS-DA). Its success is primarily due to ease of interpretation, through ...projection to latent structures, and transparent assessment of feature importance using regression coefficients and Variable Importance in Projection scores. In recent years several non-linear machine learning (ML) methods have grown in popularity but with limited uptake essentially due to convoluted optimisation and interpretation. Artificial neural networks (ANNs) are a non-linear projection-based ML method that share a structural equivalence with PLS, and as such should be amenable to equivalent optimisation and interpretation methods.
Objectives
We hypothesise that standardised optimisation, visualisation, evaluation and statistical inference techniques commonly used by metabolomics researchers for PLS-DA can be migrated to a non-linear, single hidden layer, ANN.
Methods
We compared a standardised optimisation, visualisation, evaluation and statistical inference techniques workflow for PLS with the proposed ANN workflow. Both workflows were implemented in the Python programming language. All code and results have been made publicly available as Jupyter notebooks on GitHub.
Results
The migration of the PLS workflow to a non-linear, single hidden layer, ANN was successful. There was a similarity in significant metabolites determined using PLS model coefficients and ANN Connection Weight Approach.
Conclusion
We have shown that it is possible to migrate the standardised PLS-DA workflow to simple non-linear ANNs. This result opens the door for more widespread use and to the investigation of transparent interpretation of more complex ANN architectures.
The application of large-scale metabolomic profiling provides new opportunities for realizing the potential of omics-based precision medicine for asthma. By leveraging data from over 14,000 ...individuals in four distinct cohorts, this study identifies and independently replicates 17 steroid metabolites whose levels were significantly reduced in individuals with prevalent asthma. Although steroid levels were reduced among all asthma cases regardless of medication use, the largest reductions were associated with inhaled corticosteroid (ICS) treatment, as confirmed in a 4-year low-dose ICS clinical trial. Effects of ICS treatment on steroid levels were dose dependent; however, significant reductions also occurred with low-dose ICS treatment. Using information from electronic medical records, we found that cortisol levels were substantially reduced throughout the entire 24-hour daily period in patients with asthma who were treated with ICS compared to those who were untreated and to patients without asthma. Moreover, patients with asthma who were treated with ICS showed significant increases in fatigue and anemia as compared to those without ICS treatment. Adrenal suppression in patients with asthma treated with ICS might, therefore, represent a larger public health problem than previously recognized. Regular cortisol monitoring of patients with asthma treated with ICS is needed to provide the optimal balance between minimizing adverse effects of adrenal suppression while capitalizing on the established benefits of ICS treatment.
Current guidelines do not sufficiently capture the heterogeneous nature of asthma; a more detailed molecular classification is needed. Metabolomics represents a novel and compelling approach to ...derive asthma endotypes (i.e., subtypes defined by functional and/or pathobiological mechanisms).
To validate metabolomic-driven endotypes of asthma and explore their underlying biology.
In the Genetics of Asthma in Costa Rica Study (GACRS), untargeted metabolomic profiling, similarity network fusion, and spectral clustering was used to identify metabo-endotypes of asthma, and differences in asthma-relevant phenotypes across these metabo-endotypes were explored. The metabo-endotypes were recapitulated in the Childhood Asthma Management Program (CAMP), and clinical differences were determined. Metabolomic drivers of metabo-endotype membership were investigated by meta-analyzing findings from GACRS and CAMP.
Five metabo-endotypes were identified in GACRS with significant differences in asthma-relevant phenotypes, including prebronchodilator (p-ANOVA = 8.3 × 10
) and postbronchodilator (p-ANOVA = 1.8 × 10
) FEV
/FVC. These differences were validated in the recapitulated metabo-endotypes in CAMP. Cholesterol esters, trigylcerides, and fatty acids were among the most important drivers of metabo-endotype membership. The findings suggest dysregulation of pulmonary surfactant homeostasis may play a role in asthma severity.
Clinically meaningful endotypes may be derived and validated using metabolomic data. Interrogating the drivers of these metabo-endotypes has the potential to help understand their pathophysiology.
Respiratory infections are a leading cause of morbidity and mortality in early life, and recurrent infections increase the risk of developing chronic diseases. The maternal environment during ...pregnancy can impact offspring health, but the factors leading to increased infection proneness have not been well characterized during this period. Steroids have been implicated in respiratory health outcomes and may similarly influence infection susceptibility. Our objective was to describe relationships between maternal steroid levels and offspring infection proneness. Using adjusted Poisson regression models, we evaluated associations between sixteen androgenic and corticosteroid metabolites during pregnancy and offspring respiratory infection incidence across two pre-birth cohorts (N = 774 in VDAART and N = 729 in COPSAC). Steroid metabolites were measured in plasma samples from pregnant mothers across all trimesters of pregnancy by ultrahigh-performance-liquid-chromatography/mass-spectrometry. We conducted further inquiry into associations of steroids with related respiratory outcomes: asthma and lung function spirometry. Higher plasma corticosteroid levels in the third trimester of pregnancy were associated with lower incidence of offspring respiratory infections (P = 4.45 × 10
to 0.002) and improved lung function metrics (P = 0.020-0.036). Elevated maternal androgens were generally associated with increased offspring respiratory infections and worse lung function, with some associations demonstrating nominal significance at P < 0.05, but these trends were inconsistent across individual androgens. Increased maternal plasma corticosteroid levels in the late second and third trimesters were associated with lower infections and better lung function in offspring, which may represent a potential avenue for intervention through corticosteroid supplementation in late pregnancy to reduce offspring respiratory infection susceptibility in early life.Clinical Trial Registry information: VDAART and COPSAC were originally conducted as clinical trials; VDAART: ClinicalTrials.gov identifier NCT00920621; COPSAC: ClinicalTrials.gov identifier NCT00798226.
Metabolomics holds great promise for uncovering insights around biological processes impacting disease in human epidemiological studies. Metabolites can be measured across biological samples, ...including plasma, serum, saliva, urine, stool, and whole organs and tissues, offering a means to characterize metabolic processes relevant to disease etiology and traits of interest. Metabolomic epidemiology studies face unique challenges, such as identifying metabolites from targeted and untargeted assays, defining standards for quality control, harmonizing results across platforms that often capture different metabolites, and developing statistical methods for high-dimensional and correlated metabolomic data. In this review, we introduce metabolomic epidemiology to the broader scientific community, discuss opportunities and challenges presented by these studies, and highlight emerging innovations that hold promise to uncover new biological insights.
Metabolomics holds great promise for uncovering insights around biological processes impacting disease in human epidemiological studies. Metabolites can be measured across biological samples, including plasma, serum, saliva, urine, stool, and whole organs and tissues, offering a means to characterize metabolic processes relevant to disease etiology and traits of interest. Metabolomic epidemiology studies face unique challenges, such as identifying metabolites from targeted and untargeted assays, defining standards for quality control, harmonizing results across platforms that often capture different metabolites, and developing statistical methods for high-dimensional and correlated metabolomic data. In this review, we introduce metabolomic epidemiology to the broader scientific community, discuss opportunities and challenges presented by these studies, and highlight emerging innovations that hold promise to uncover new biological insights.
The rapidly emerging field of metabolomic epidemiology presents unique opportunities to gain mechanistic insights into disease risk and identify biomarkers that may inform prevention and screening strategies.Challenges include dealing with batch effects and drift, the sensitivity of metabolites to environmental exposures and the handling and processing of samples, metabolite identification, harmonizing metabolites across different platforms, analyzing high-dimensional data, the complex correlation structure of metabolites, and integrating metabolomics with other ‘omic data types.We provide a broad introduction to the field for scientists from cross-disciplinary backgrounds, discussing technological, study design, quality control, and statistical considerations, opportunities and challenges in the field, and emerging innovations that hold promise to uncover new biological insights.
Circulating metabolite levels may reflect the state of the human organism in health and disease, however, the genetic architecture of metabolites is not fully understood. We have performed a ...whole-genome sequencing association analysis of both common and rare variants in up to 11,840 multi-ethnic participants from five studies with up to 1666 circulating metabolites. We have discovered 1985 novel variant-metabolite associations, and validated 761 locus-metabolite associations reported previously. Seventy-nine novel variant-metabolite associations have been replicated, including three genetic loci located on the X chromosome that have demonstrated its involvement in metabolic regulation. Gene-based analysis have provided further support for seven metabolite-replicated loci pairs and their biologically plausible genes. Among those novel replicated variant-metabolite pairs, follow-up analyses have revealed that 26 metabolites have colocalized with 21 tissues, seven metabolite-disease outcome associations have been putatively causal, and 7 metabolites might be regulated by plasma protein levels. Our results have depicted the genetic contribution to circulating metabolite levels, providing additional insights into understanding human disease.
Changes in cell-type composition of tissues are associated with a wide range of diseases and environmental risk factors and may be causally implicated in disease development and progression. However, ...these shifts in cell-type fractions are often of a low magnitude, or involve similar cell subtypes, making their reliable identification challenging. DNA methylation profiling in a tissue like blood is a promising approach to discover shifts in cell-type abundance, yet studies have only been performed at a relatively low cellular resolution and in isolation, limiting their power to detect shifts in tissue composition.
Here we derive a DNA methylation reference matrix for 12 immune-cell types in human blood and extensively validate it with flow-cytometric count data and in whole-genome bisulfite sequencing data of sorted cells. Using this reference matrix, we perform a directional Stouffer and fixed effects meta-analysis comprising 23,053 blood samples from 22 different cohorts, to comprehensively map associations between the 12 immune-cell fractions and common phenotypes. In a separate cohort of 4386 blood samples, we assess associations between immune-cell fractions and health outcomes.
Our meta-analysis reveals many associations of cell-type fractions with age, sex, smoking and obesity, many of which we validate with single-cell RNA sequencing. We discover that naïve and regulatory T-cell subsets are higher in women compared to men, while the reverse is true for monocyte, natural killer, basophil, and eosinophil fractions. Decreased natural killer counts associated with smoking, obesity, and stress levels, while an increased count correlates with exercise and sleep. Analysis of health outcomes revealed that increased naïve CD4 + T-cell and N-cell fractions associated with a reduced risk of all-cause mortality independently of all major epidemiological risk factors and baseline co-morbidity. A machine learning predictor built only with immune-cell fractions achieved a C-index value for all-cause mortality of 0.69 (95%CI 0.67-0.72), which increased to 0.83 (0.80-0.86) upon inclusion of epidemiological risk factors and baseline co-morbidity.
This work contributes an extensively validated high-resolution DNAm reference matrix for blood, which is made freely available, and uses it to generate a comprehensive map of associations between immune-cell fractions and common phenotypes, including health outcomes.