Abstract
Machine learning methods and in particular random forests are promising approaches for prediction based on high dimensional omics data sets. They provide variable importance measures to rank ...predictors according to their predictive power. If building a prediction model is the main goal of a study, often a minimal set of variables with good prediction performance is selected. However, if the objective is the identification of involved variables to find active networks and pathways, approaches that aim to select all relevant variables should be preferred. We evaluated several variable selection procedures based on simulated data as well as publicly available experimental methylation and gene expression data. Our comparison included the Boruta algorithm, the Vita method, recurrent relative variable importance, a permutation approach and its parametric variant (Altmann) as well as recursive feature elimination (RFE).
In our simulation studies, Boruta was the most powerful approach, followed closely by the Vita method. Both approaches demonstrated similar stability in variable selection, while Vita was the most robust approach under a pure null model without any predictor variables related to the outcome. In the analysis of the different experimental data sets, Vita demonstrated slightly better stability in variable selection and was less computationally intensive than Boruta.
In conclusion, we recommend the Boruta and Vita approaches for the analysis of high-dimensional data sets. Vita is considerably faster than Boruta and thus more suitable for large data sets, but only Boruta can also be applied in low-dimensional settings.
Yield losses caused by fungal pathogens represent a major threat to global food production. One of the most devastating fungal wheat pathogens is Zymoseptoria tritici. Despite the importance of this ...fungus, the underlying mechanisms of plant-pathogen interactions are poorly understood. Here we present a conceptual framework based on coinfection assays, comparative metabolomics, and microbiome profiling to study the interaction of Z. tritici in susceptible and resistant wheat. We demonstrate that Z. tritici suppresses the production of immune-related metabolites in a susceptible cultivar. Remarkably, this fungus-induced immune suppression spreads within the leaf and even to other leaves, a phenomenon that we term "systemic induced susceptibility". Using a comparative metabolomics approach, we identify defense-related biosynthetic pathways that are suppressed and induced in susceptible and resistant cultivars, respectively. We show that these fungus-induced changes correlate with changes in the wheat leaf microbiome. Our findings suggest that immune suppression by this hemibiotrophic pathogen impacts specialized plant metabolism, alters its associated microbial communities, and renders wheat vulnerable to further infections.
Gene network information is believed to be beneficial for disease module and pathway identification, but has not been explicitly utilized in the standard random forest (RF) algorithm for gene ...expression data analysis. We investigate the performance of a network-guided RF where the network information is summarized into a sampling probability of predictor variables which is further used in the construction of the RF.
Our simulation results suggest that network-guided RF does not provide better disease prediction than the standard RF. In terms of disease gene discovery, if disease genes form module(s), network-guided RF identifies them more accurately. In addition, when disease status is independent from genes in the given network, spurious gene selection results can occur when using network information, especially on hub genes. Our empirical analysis on two balanced microarray and RNA-Seq breast cancer datasets from The Cancer Genome Atlas (TCGA) for classification of progesterone receptor (PR) status also demonstrates that network-guided RF can identify genes from PGR-related pathways, which leads to a better connected module of identified genes.
Gene networks can provide additional information to aid the gene expression analysis for disease module and pathway identification. But they need to be used with caution and validation on the results need to be carried out to guard against spurious gene selection. More robust approaches to incorporate such information into RF construction also warrant further study.
Abstract
In longitudinal studies variables are measured repeatedly over time, leading to clustered and correlated observations. If the goal of the study is to develop prediction models, machine ...learning approaches such as the powerful random forest (RF) are often promising alternatives to standard statistical methods, especially in the context of high-dimensional data. In this paper, we review extensions of the standard RF method for the purpose of longitudinal data analysis. Extension methods are categorized according to the data structures for which they are designed. We consider both univariate and multivariate response longitudinal data and further categorize the repeated measurements according to whether the time effect is relevant. Even though most extensions are proposed for low-dimensional data, some can be applied to high-dimensional data. Information of available software implementations of the reviewed extensions is also given. We conclude with discussions on the limitations of our review and some future research directions.
Case-only (CO) studies are a powerful means to uncover gene-environment (G × E) interactions for complex human diseases. Moreover, such studies may in principle also draw upon genotype imputation to ...increase statistical power even further. However, genotype imputation usually employs healthy controls such as the Haplotype Reference Consortium (HRC) data as an imputation base, which may systematically perturb CO studies in genomic regions with main effects upon disease risk. Using genotype data from 719 German Crohn Disease (CD) patients, we investigated the level of imputation accuracy achievable for single nucleotide polymorphisms (SNPs) with or without a genetic main effect, and with varying minor allele frequency (MAF). Genotypes were imputed from neighbouring SNPs at different levels of linkage disequilibrium (LD) to the target SNP using the HRC data as an imputation base. Comparison of the true and imputed genotypes revealed lower imputation accuracy for SNPs with strong main effects. We also simulated different levels of G × E interaction to evaluate the potential loss of statistical validity and power incurred by the use of imputed genotypes. Simulations under the null hypothesis revealed that genotype imputation does not inflate the type I error rate of CO studies of G × E. However, the statistical power was found to be reduced by imputation, particularly for SNPs with low MAF, and a gradual loss of statistical power resulted when the level of LD to the SNPs driving the imputation decreased. Our study thus highlights that genotype imputation should be employed with great care in CO studies of G × E interaction.
Variability of gene expression in human may link gene sequence variability and phenotypes; however, non-genetic variations, alone or in combination with genetics, may also influence expression traits ...and have a critical role in physiological and disease processes.
To get better insight into the overall variability of gene expression, we assessed the transcriptome of circulating monocytes, a key cell involved in immunity-related diseases and atherosclerosis, in 1,490 unrelated individuals and investigated its association with >675,000 SNPs and 10 common cardiovascular risk factors. Out of 12,808 expressed genes, 2,745 expression quantitative trait loci were detected (P<5.78x10(-12)), most of them (90%) being cis-modulated. Extensive analyses showed that associations identified by genome-wide association studies of lipids, body mass index or blood pressure were rarely compatible with a mediation by monocyte expression level at the locus. At a study-wide level (P<3.9x10(-7)), 1,662 expression traits (13.0%) were significantly associated with at least one risk factor. Genome-wide interaction analyses suggested that genetic variability and risk factors mostly acted additively on gene expression. Because of the structure of correlation among expression traits, the variability of risk factors could be characterized by a limited set of independent gene expressions which may have biological and clinical relevance. For example expression traits associated with cigarette smoking were more strongly associated with carotid atherosclerosis than smoking itself.
This study demonstrates that the monocyte transcriptome is a potent integrator of genetic and non-genetic influences of relevance for disease pathophysiology and risk assessment.
Purpose
Evaluation of water material density images (wMDIm) of dual-energy CT (DECT) for earlier prediction of final infarct volume (fiV) in follow-up single-energy CT (SECT) and correlation with ...clinical outcome.
Methods
Fifty patients (69 years, ± 12.1, 40–90, 50% female) with middle cerebral artery (MCA) occlusions were included. Early infarct volumes were analyzed in monoenergetic images (MonoIm) and wMDIm at 60 keV and compared with the fiV in SECT 4.9 days (± 4) after thrombectomy. Association between infarct volume and functional outcome was tested by linear regression analysis.
Results
wMDIm shows a prior visible infarct demarcation (60.7 ml, ± 74.9 ml) compared with the MonoIm (37.57 ml, ± 76.7 ml). Linear regression analysis, Bland–Altman plots and Pearson correlation coefficients show a close correlation of infarct volume in wMDIm to the fiV in SECT (
r
= 0.86; 95% CI 0.76–0.92), compared with MonoIm and SECT (
r
= 0.81; 95% CI 0.69–0.89). The agreement with SECT is substantially higher in patients with infarct volumes < 70 ml (
n
= 33; 66%). Coefficients were smaller with
r
= 0.59 (95% CI 0.31; 0.78) for MonoIm and SECT compared with
r
= 0.77 (95% CI 0.57; 0.88) for wMDIm and SECT. At admission, the mean NIHSS score and mRS were 17.02 (± 4.7) and 4.9 (± 0.2). mRS ≤ 2 was achieved in 56% at 90 days with a mean mRS of 2.5 (± 0.8) at discharge.
Conclusion
Material decomposition allows earlier visibility of the final infarct volume. This promises an earlier evaluation of the dimension and severity of infarction and may lead to faster initiation of secondary stroke prophylaxis.
The diagnosis of inflammatory bowel disease (IBD) still remains a clinical challenge and the most accurate diagnostic procedure is a combination of clinical tests including invasive endoscopy. In ...this study we evaluated whether systematic miRNA expression profiling, in conjunction with machine learning techniques, is suitable as a non-invasive test for the major IBD phenotypes (Crohn's disease (CD) and ulcerative colitis (UC)). Based on microarray technology, expression levels of 863 miRNAs were determined for whole blood samples from 40 CD and 36 UC patients and compared to data from 38 healthy controls (HC). To further discriminate between disease-specific and general inflammation we included miRNA expression data from other inflammatory diseases (inflammation controls (IC): 24 chronic obstructive pulmonary disease (COPD), 23 multiple sclerosis, 38 pancreatitis and 45 sarcoidosis cases) as well as 70 healthy controls from previous studies. Classification problems considering 2, 3 or 4 groups were solved using different types of penalized support vector machines (SVMs). The resulting models were assessed regarding sparsity and performance and a subset was selected for further investigation. Measured by the area under the ROC curve (AUC) the corresponding median holdout-validated accuracy was estimated as ranging from 0.75 to 1.00 (including IC) and 0.89 to 0.98 (excluding IC), respectively. In combination, the corresponding models provide tools for the distinction of CD and UC as well as CD, UC and HC with expected classification error rates of 3.1 and 3.3%, respectively. These results were obtained by incorporating not more than 16 distinct miRNAs. Validated target genes of these miRNAs have been previously described as being related to IBD. For others we observed significant enrichment for IBD susceptibility loci identified in earlier GWAS. These results suggest that the proposed miRNA signature is of relevance for the etiology of IBD. Its diagnostic value, however, should be further evaluated in large, independent, clinically well characterized cohorts.
Purpose
Interleukin-6 (IL-6) production and signalling are increased in the inflamed mucosa in inflammatory bowel diseases (IBD). As published serum levels of IL-6 and its soluble receptors sIL-6R ...and sgp130 in IBD are from small cohorts and partly contradictory, we systematically evaluated IL-6, sIL-6R and sgp130 levels as markers of disease activity in Crohn’s disease (CD) and ulcerative colitis (UC).
Methods
Consecutive adult outpatients with confirmed CD or UC were included, and their disease activity and medication were monitored. Serum from 212 CD patients (815 measurements) and 166 UC patients (514 measurements) was analysed, and 100 age-matched healthy blood donors were used as controls.
Results
IL-6 serum levels were significantly elevated in active versus inactive CD and UC, also compared with healthy controls. However, only a fraction of IBD patients showed increased serum IL-6. IL-6 levels ranged up to 32.7 ng/mL in active CD (> 5000-fold higher than in controls), but also up to 6.9 ng/mL in inactive CD. Increases in active UC (up to 195 pg/mL) and inactive UC (up to 27 pg/mL) were less pronounced. Associations between IL-6 serum levels and C-reactive protein concentrations as well as leukocyte and thrombocyte counts were observed. Median sIL-6R and sgp130 levels were only increased by up to 15%, which was considered of no diagnostic significance.
Conclusions
Only a minority of IBD patients shows elevated IL-6 serum levels. However, in these patients, IL-6 is strongly associated with disease activity. Its soluble receptors sIL-6R and sgp130 do not appear useful as biomarkers in IBD.
Dysbiosis of the gut microbiome is a hallmark of inflammatory bowel disease (IBD) and both, IBD risk and microbiome composition, have been found to be associated with genetic variation. Using data ...from families of IBD patients, we examined the association between genetic and microbiome similarity in a specific IBD context, followed by a genome-wide quantitative trait locus (QTL) linkage analysis of various microbiome traits using the same data. SNP genotypes as well as gut microbiome and phenotype data were obtained from the Kiel IBD family cohort (IBD-KC). The IBD-KC is an ongoing prospective study in Germany currently comprising 256 families with 455 IBD patients and 575 first- and second-degree relatives. Initially focusing upon known IBD risk loci, we noted a statistically significant (FDR<0.05) association between genetic similarity at SNP rs11741861 and overall microbiome dissimilarity among pairs of relatives discordant for IBD. In a genome-wide QTL analysis, 12 chromosomal regions were found to be linked to the abundance of one of seven microbial genera, namely Barnesiella (chromosome 4, region spanning 10.34 cM), Clostridium_XIVa (chr4, 3.86 cM; chr14, two regions spanning 7.05 and 13.02 cM respectively), Pseudoflavonifractor (chr7, 12.80 cM) Parasutterella (chr14, 8.26 cM), Ruminococcus (chr16, two overlapping regions spanning 8.01 and 16.87 cM, respectively), Roseburia (chr19, 7.99 cM), and Odoribacter (chr22, three regions spanning 0.89, 5.57 and 1.71 cM, respectively), as well as the Shannon index of α diversity (chr3, 1.47 cM). Our study thus shows that, in families of IBD patients, pairwise genetic similarity for at least one IBD risk locus is associated with overall microbiome dissimilarity among discordant pairs of relatives, and that hitherto unknown genetic modifiers of microbiome traits are located in at least 12 human genomic regions.