Personalized, precision, P4, or stratified medicine is understood as a medical approach in which patients are stratified based on their disease subtype, risk, prognosis, or treatment response using ...specialized diagnostic tests. The key idea is to base medical decisions on individual patient characteristics, including molecular and behavioral biomarkers, rather than on population averages. Personalized medicine is deeply connected to and dependent on data science, specifically machine learning (often named Artificial Intelligence in the mainstream media). While during recent years there has been a lot of enthusiasm about the potential of 'big data' and machine learning-based solutions, there exist only few examples that impact current clinical practice. The lack of impact on clinical practice can largely be attributed to insufficient performance of predictive models, difficulties to interpret complex model predictions, and lack of validation via prospective clinical trials that demonstrate a clear benefit compared to the standard of care. In this paper, we review the potential of state-of-the-art data science approaches for personalized medicine, discuss open challenges, and highlight directions that may help to overcome them in the future.
There is a need for an interdisciplinary effort, including data scientists, physicians, patient advocates, regulatory agencies, and health insurance organizations. Partially unrealistic expectations and concerns about data science-based solutions need to be better managed. In parallel, computational methods must advance more to provide direct benefit to clinical practice.
Recent advances in highly multiplexed immunoassays have allowed systematic large-scale measurement of hundreds of plasma proteins in large cohort studies. In combination with genotyping, such studies ...offer the prospect to 1) identify mechanisms involved with regulation of protein expression in plasma, and 2) determine whether the plasma proteins are likely to be causally implicated in disease. We report here the results of genome-wide association (GWA) studies of 83 proteins considered relevant to cardiovascular disease (CVD), measured in 3,394 individuals with multiple CVD risk factors. We identified 79 genome-wide significant (p<5e-8) association signals, 55 of which replicated at P<0.0007 in separate validation studies (n = 2,639 individuals). Using automated text mining, manual curation, and network-based methods incorporating information on expression quantitative trait loci (eQTL), we propose plausible causal mechanisms for 25 trans-acting loci, including a potential post-translational regulation of stem cell factor by matrix metalloproteinase 9 and receptor-ligand pairs such as RANK-RANK ligand. Using public GWA study data, we further evaluate all 79 loci for their causal effect on coronary artery disease, and highlight several potentially causal associations. Overall, a majority of the plasma proteins studied showed evidence of regulation at the genetic level. Our results enable future studies of the causal architecture of human disease, which in turn should aid discovery of new drug targets.
Abstract
Quantitative trait locus (QTL) mapping of molecular phenotypes such as metabolites, lipids and proteins through genome-wide association studies represents a powerful means of highlighting ...molecular mechanisms relevant to human diseases. However, a major challenge of this approach is to identify the causal gene(s) at the observed QTLs. Here, we present a framework for the 'Prioritization of candidate causal Genes at Molecular QTLs' (ProGeM), which incorporates biological domain-specific annotation data alongside genome annotation data from multiple repositories. We assessed the performance of ProGeM using a reference set of 227 previously reported and extensively curated metabolite QTLs. For 98% of these loci, the expert-curated gene was one of the candidate causal genes prioritized by ProGeM. Benchmarking analyses revealed that 69% of the causal candidates were nearest to the sentinel variant at the investigated molecular QTLs, indicating that genomic proximity is the most reliable indicator of 'true positive' causal genes. In contrast, cis-gene expression QTL data led to three false positive candidate causal gene assignments for every one true positive assignment. We provide evidence that these conclusions also apply to other molecular phenotypes, suggesting that ProGeM is a powerful and versatile tool for annotating molecular QTLs. ProGeM is freely available via GitHub.
Stratification of patient subpopulations that respond favorably to treatment or experience and adverse reaction is an essential step toward development of new personalized therapies and diagnostics. ...It is currently feasible to generate omic-scale biological measurements for all patients in a study, providing an opportunity for machine learning models to identify molecular markers for disease diagnosis and progression. However, the high variability of genetic background in human populations hampers the reproducibility of omic-scale markers. In this paper, we develop a biological network-based regularized artificial neural network model for prediction of phenotype from transcriptomic measurements in clinical trials. To improve model sparsity and the overall reproducibility of the model, we incorporate regularization for simultaneous shrinkage of gene sets based on active upstream regulatory mechanisms into the model.
We benchmark our method against various regression, support vector machines and artificial neural network models and demonstrate the ability of our method in predicting the clinical outcomes using clinical trial data on acute rejection in kidney transplantation and response to Infliximab in ulcerative colitis. We show that integration of prior biological knowledge into the classification as developed in this paper, significantly improves the robustness and generalizability of predictions to independent datasets. We provide a Java code of our algorithm along with a parsed version of the STRING DB database.
In summary, we present a method for prediction of clinical phenotypes using baseline genome-wide expression data that makes use of prior biological knowledge on gene-regulatory interactions in order to increase robustness and reproducibility of omic-scale markers. The integrated group-wise regularization methods increases the interpretability of biological signatures and gives stable performance estimates across independent test sets.
Discovery of robust diagnostic or prognostic biomarkers is a key to optimizing therapeutic benefit for select patient cohorts - an idea commonly referred to as precision medicine. Most discovery ...studies to derive such markers from high-dimensional transcriptomics datasets are weakly powered with sample sizes in the tens of patients. Therefore, highly regularized statistical approaches are essential to making generalizable predictions. At the same time, prior knowledge-driven approaches have been successfully applied to the manual interpretation of high-dimensional transcriptomics datasets. In this work, we assess the impact of combining two orthogonal approaches for the discovery of biomarker signatures, namely (1) well-known lasso-based regression approaches and its more recent derivative, the group lasso, and (2) the discovery of significant upstream regulators in literature-derived biological networks. Our method integrates both approaches in a weighted group-lasso model and differentially weights gene sets based on inferred active regulatory mechanism. Using nested cross-validation as well as independent clinical datasets, we demonstrate that our approach leads to increased accuracy and generalizable results. We implement our approach in a computationally efficient, user-friendly R package called creNET. The package can be downloaded at https://github.com/kouroshz/creNethttps://github.com/kouroshz/creNet and is accompanied by a parsed version of the STRING DB data base.
Abstract
Objectives
Advances in immunotherapy by blocking TNF have remarkably improved treatment outcomes for Rheumatoid arthritis (RA) patients. Although treatment specifically targets TNF, the ...downstream mechanisms of immune suppression are not completely understood. The aim of this study was to detect biomarkers and expression signatures of treatment response to TNF inhibition.
Methods
Peripheral blood mononuclear cells (PBMCs) from 39 female patients were collected before anti-TNF treatment initiation (day 0) and after 3 months. The study cohort included patients previously treated with MTX who failed to respond adequately. Response to treatment was defined based on the EULAR criteria and classified 23 patients as responders and 16 as non-responders. We investigated differences in gene expression in PBMCs, the proportion of cell types and cell phenotypes in peripheral blood using flow cytometry and the level of proteins in plasma. Finally, we used machine learning models to predict non-response to anti-TNF treatment.
Results
The gene expression analysis in baseline samples revealed notably higher expression of the gene EPPK1 in future responders. We detected the suppression of genes and proteins following treatment, including suppressed expression of the T cell inhibitor gene CHI3L1 and its protein YKL-40. The gene expression results were replicated in an independent cohort. Finally, machine learning models mainly based on transcriptomic data showed high predictive utility in classifying non-response to anti-TNF treatment in RA.
Conclusions
Our integrative multi-omics analyses identified new biomarkers for the prediction of response, found pathways influenced by treatment and suggested new predictive models of anti-TNF treatment in RA patients.
The ability to confidently predict health outcomes from gene expression would catalyze a revolution in molecular diagnostics. Yet, the goal of developing actionable, robust, and reproducible ...predictive signatures of phenotypes such as clinical outcome has not been attained in almost any disease area. Here, we report a comprehensive analysis spanning prediction tasks from ulcerative colitis, atopic dermatitis, diabetes, to many cancer subtypes for a total of 24 binary and multiclass prediction problems and 26 survival analysis tasks. We systematically investigate the influence of gene subsets, normalization methods and prediction algorithms. Crucially, we also explore the novel use of deep representation learning methods on large transcriptomics compendia, such as GTEx and TCGA, to boost the performance of state-of-the-art methods. The resources and findings in this work should serve as both an up-to-date reference on attainable performance, and as a benchmarking resource for further research.
Approaches that combine large numbers of genes outperformed single gene methods consistently and with a significant margin, but neither unsupervised nor semi-supervised representation learning techniques yielded consistent improvements in out-of-sample performance across datasets. Our findings suggest that using l
-regularized regression methods applied to centered log-ratio transformed transcript abundances provide the best predictive analyses overall.
Transcriptomics-based phenotype prediction benefits from proper normalization techniques and state-of-the-art regularized regression approaches. In our view, breakthrough performance is likely contingent on factors which are independent of normalization and general modeling techniques; these factors might include reduction of systematic errors in sequencing data, incorporation of other data types such as single-cell sequencing and proteomics, and improved use of prior knowledge.
Abstract Genetic variation in the human leukocyte antigen (HLA) loci is associated with risk of immune-mediated diseases, but the molecular effects of HLA polymorphism are unclear. Here we examined ...the effects of HLA genetic variation on the expression of 2940 plasma proteins across 45,330 Europeans in the UK Biobank, with replication analyses across multiple ancestry groups. We detected 504 proteins affected by HLA variants (HLA-pQTL), including widespread trans effects by autoimmune disease risk alleles. More than 80% of the HLA-pQTL fine-mapped to amino acid positions in the peptide binding groove. HLA-I and II affected proteins expressed in similar cell types but in different pathways of both adaptive and innate immunity. Finally, we investigated potential HLA-pQTL effects on disease by integrating HLA-pQTL with fine-mapped HLA-disease signals in the UK Biobank. Our data reveal the diverse effects of HLA genetic variation and aid the interpretation of associations between HLA alleles and immune-mediated diseases.
Diabetes is the leading cause of ESRD. Despite evidence for a substantial heritability of diabetic kidney disease, efforts to identify genetic susceptibility variants have had limited success. We ...extended previous efforts in three dimensions, examining a more comprehensive set of genetic variants in larger numbers of subjects with type 1 diabetes characterized for a wider range of cross-sectional diabetic kidney disease phenotypes. In 2843 subjects, we estimated that the heritability of diabetic kidney disease was 35% (P=6.4×10
). Genome-wide association analysis and replication in 12,540 individuals identified no single variants reaching stringent levels of significance and, despite excellent power, provided little independent confirmation of previously published associated variants. Whole-exome sequencing in 997 subjects failed to identify any large-effect coding alleles of lower frequency influencing the risk of diabetic kidney disease. However, sets of alleles increasing body mass index (P=2.2×10
) and the risk of type 2 diabetes (P=6.1×10
) associated with the risk of diabetic kidney disease. We also found genome-wide genetic correlation between diabetic kidney disease and failure at smoking cessation (P=1.1×10
). Pathway analysis implicated ascorbate and aldarate metabolism (P=9.0×10
), and pentose and glucuronate interconversions (P=3.0×10
) in pathogenesis of diabetic kidney disease. These data provide further evidence for the role of genetic factors influencing diabetic kidney disease in those with type 1 diabetes and highlight some key pathways that may be responsible. Altogether these results reveal important biology behind the major cause of kidney disease.
Methotrexate (MTX) is a common first-line treatment for new-onset rheumatoid arthritis (RA). However, MTX is ineffective for 30-40% of patients and there is no way to know which patients might ...benefit. Here, we built statistical models based on serum lipid levels measured at two time-points (pre-treatment and following 4 weeks on-drug) to investigate if MTX response (by 6 months) could be predicted. Patients about to commence MTX treatment for the first time were selected from the Rheumatoid Arthritis Medication Study (RAMS). Patients were categorised as good or non-responders following 6 months on-drug using EULAR response criteria. Serum lipids were measured using ultra-performance liquid chromatography-mass spectrometry and supervised machine learning methods (including regularized regression, support vector machine and random forest) were used to predict EULAR response. Models including lipid levels were compared to models including clinical covariates alone. The best performing classifier including lipid levels (assessed at 4 weeks) was constructed using regularized regression (ROC AUC 0.61 ± 0.02). However, the clinical covariate based model outperformed the classifier including lipid levels when either pre- or on-treatment time-points were investigated (ROC AUC 0.68 ± 0.02). Pre- or early-treatment serum lipid profiles are unlikely to inform classification of MTX response by 6 months with performance adequate for use in RA clinical management.