The revival of the Gini importance? Nembrini, Stefano; König, Inke R; Wright, Marvin N
Bioinformatics,
11/2018, Letnik:
34, Številka:
21
Journal Article
Recenzirano
Odprti dostop
Abstract
Motivation
Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable ...importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible split points and high minor allele frequency.
Results
We set up a fast approach to debias impurity-based variable importance measures for classification, regression and survival forests. We show that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance. As a result, it is now possible to compute reliable importance estimates without the extra computing cost of permutations. Further, we combine the importance measure with a fast testing procedure, producing p-values for variable importance with almost no computational overhead to the creation of the random forest. Applications to gene expression and genome-wide association data show that the proposed method is powerful and computationally efficient.
Availability and implementation
The procedure is included in the ranger package, available at https://cran.r-project.org/package=ranger and https://github.com/imbs-hl/ranger.
Supplementary information
Supplementary data are available at Bioinformatics online.
What is precision medicine? König, Inke R; Fuchs, Oliver; Hansen, Gesine ...
European respiratory journal/The European respiratory journal
50, Številka:
4
Journal Article
Recenzirano
Odprti dostop
The term "precision medicine" has become very popular over recent years, fuelled by scientific as well as political perspectives. Despite its popularity, its exact meaning, and how it is different ...from other popular terms such as "stratified medicine", "targeted therapy" or "deep phenotyping" remains unclear. Commonly applied definitions focus on the stratification of patients, sometimes referred to as a novel taxonomy, and this is derived using large-scale data including clinical, lifestyle, genetic and further biomarker information, thus going beyond the classical "signs-and-symptoms" approach.While these aspects are relevant, this description leaves open a number of questions. For example, when does precision medicine begin? In which way does the stratification of patients translate into better healthcare? And can precision medicine be viewed as the end-point of a novel stratification of patients, as implied, or is it rather a greater whole?To clarify this, the aim of this paper is to provide a more comprehensive definition that focuses on precision medicine as a process. It will be shown that this proposed framework incorporates the derivation of novel taxonomies and their role in healthcare as part of the cycle, but also covers related terms.
Random forests have often been claimed to uncover interaction effects. However, if and how interaction effects can be differentiated from marginal effects remains unclear. In extensive simulation ...studies, we investigate whether random forest variable importance measures capture or detect gene-gene interactions. With capturing interactions, we define the ability to identify a variable that acts through an interaction with another one, while detection is the ability to identify an interaction effect as such.
Of the single importance measures, the Gini importance captured interaction effects in most of the simulated scenarios, however, they were masked by marginal effects in other variables. With the permutation importance, the proportion of captured interactions was lower in all cases. Pairwise importance measures performed about equal, with a slight advantage for the joint variable importance method. However, the overall fraction of detected interactions was low. In almost all scenarios the detection fraction in a model with only marginal effects was larger than in a model with an interaction effect only.
Random forests are generally capable of capturing gene-gene interactions, but current variable importance measures are unable to detect them as interactions. In most of the cases, interactions are masked by marginal effects and interactions cannot be differentiated from marginal effects. Consequently, caution is warranted when claiming that random forests uncover interactions.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
One component of precision medicine is to construct prediction models with their predicitve ability as high as possible, e.g. to enable individual risk prediction. In genetic epidemiology, complex ...diseases like coronary artery disease, rheumatoid arthritis, and type 2 diabetes, have a polygenic basis and a common assumption is that biological and genetic features affect the outcome under consideration via interactions. In the case of omics data, the use of standard approaches such as generalized linear models may be suboptimal and machine learning methods are appealing to make individual predictions. However, most of these algorithms focus mostly on main or marginal effects of the single features in a dataset. On the other hand, the detection of interacting features is an active area of research in the realm of genetic epidemiology. One big class of algorithms to detect interacting features is based on the multifactor dimensionality reduction (MDR). Here, we further develop the model-based MDR (MB-MDR), a powerful extension of the original MDR algorithm, to enable interaction empowered individual prediction.
Using a comprehensive simulation study we show that our new algorithm (median AUC: 0.66) can use information hidden in interactions and outperforms two other state-of-the-art algorithms, namely the Random Forest (median AUC: 0.54) and Elastic Net (median AUC: 0.50), if interactions are present in a scenario of two pairs of two features having small effects. The performance of these algorithms is comparable if no interactions are present. Further, we show that our new algorithm is applicable to real data by comparing the performance of the three algorithms on a dataset of rheumatoid arthritis cases and healthy controls. As our new algorithm is not only applicable to biological/genetic data but to all datasets with discrete features, it may have practical implications in other research fields where interactions between features have to be considered as well, and we made our method available as an R package ( https://github.com/imbs-hl/MBMDRClassifieR ).
The explicit use of interactions between features can improve the prediction performance and thus should be included in further attempts to move precision medicine forward.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
After an association between genetic variants and a phenotype has been established, further study goals comprise the classification of patients according to disease risk or the estimation of disease ...probability. To accomplish this, different statistical methods are required, and specifically machine-learning approaches may offer advantages over classical techniques. In this paper, we describe methods for the construction and evaluation of classification and probability estimation rules. We review the use of machine-learning approaches in this context and explain some of the machine-learning algorithms in detail. Finally, we illustrate the methodology through application to a genome-wide association analysis on rheumatoid arthritis.
Tubal factor infertility (TFI) accounts for more than 30% of the cases of female infertility and mostly resides from an inflammatory process triggered by an infection. Clinical appearances largely ...differ, and very often infections are not recognized or remain completely asymptomatic over time. Here, we characterized the microbial pattern in females diagnosed with infectious infertility (ININF) in comparison to females with non-infectious infertility (nININF), female sex workers (FSW) and healthy controls (fertile). Females diagnosed with infectious infertility differed significantly in the seroprevalence of IgG antibodies against the C. trachomatis proteins MOMP, OMP2, CPAF and HSP60 when compared to fertile females. Microbiota analysis using 16S amplicon sequencing of cervical swabs revealed significant differences between ININF and fertile controls in the relative read count of Gardnerella (10.08% vs. 5.43%). Alpha diversity varies among groups, which are characterized by community state types including Lactobacillus-dominated communities in fertile females, an increase in diversity in all the other groups and Gardnerella-dominated communities occurring more often in ININF. While all single parameters did not allow predicting infections as the cause of infertility, including C. trachomatis IgG/IgA status together with 16S rRNA gene analysis of the ten most frequent taxa a total of 93.8% of the females were correctly classified. Further studies are needed to unravel the impact of the cervical microbiota in the pathogenesis of infectious infertility and its potential for identifying females at risk earlier in life.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
In this randomized trial involving patients with out-of-hospital cardiac arrest without ST-segment elevation on postresuscitation electrocardiography, no benefit was found for immediate cardiac ...catherization as compared with delayed or selective catherization.
Motivation: Genome-wide association (GWA) studies have proven to be a successful approach for helping unravel the genetic basis of complex genetic diseases. However, the identified associations are ...not well suited for disease prediction, and only a modest portion of the heritability can be explained for most diseases, such as Type 2 diabetes or Crohn's disease. This may partly be due to the low power of standard statistical approaches to detect gene–gene and gene–environment interactions when small marginal effects are present. A promising alternative is Random Forests, which have already been successfully applied in candidate gene analyses. Important single nucleotide polymorphisms are detected by permutation importance measures. To this day, the application to GWA data was highly cumbersome with existing implementations because of the high computational burden. Results: Here, we present the new freely available software package Random Jungle (RJ), which facilitates the rapid analysis of GWA data. The program yields valid results and computes up to 159 times faster than the fastest alternative implementation, while still maintaining all options of other programs. Specifically, it offers the different permutation importance measures available. It includes new options such as the backward elimination method. We illustrate the application of RJ to a GWA of Crohn's disease. The most important single nucleotide polymorphisms (SNPs) validate recent findings in the literature and reveal potential interactions. Availability: The RJ software package is freely available at http://www.randomjungle.org Contact: inke.koenig@imbs.uni-luebeck.de; ziegler@imbs.uni-luebeck.de Supplementary information: Supplementary data are available at Bioinformatics online.
There is increasing evidence for a role of inflammation in Parkinson's disease. Recent research in murine models suggests that parkin and PINK1 deficiency leads to impaired mitophagy, which causes ...the release of mitochondrial DNA (mtDNA), thereby triggering inflammation. Specifically, the CGAS (cyclic GMP-AMP synthase)-STING (stimulator of interferon genes) pathway mitigates activation of the innate immune system, quantifiable as increased interleukin-6 (IL6) levels. However, the role of IL6 and circulating cell-free mtDNA in unaffected and affected individuals harbouring mutations in PRKN/PINK1 and idiopathic Parkinson's disease patients remain elusive. We investigated IL6, C-reactive protein, and circulating cell-free mtDNA in serum of 245 participants in two cohorts from tertiary movement disorder centres. We performed a hypothesis-driven rank-based statistical approach adjusting for multiple testing. We detected (i) elevated IL6 levels in patients with biallelic PRKN/PINK1 mutations compared to healthy control subjects in a German cohort, supporting the concept of a role for inflammation in PRKN/PINK1-linked Parkinson's disease. In addition, the comparison of patients with biallelic and heterozygous mutations in PRKN/PINK1 suggests a gene dosage effect. The differences in IL6 levels were validated in a second independent Italian cohort; (ii) a correlation between IL6 levels and disease duration in carriers of PRKN/PINK1 mutations, while no such association was observed for idiopathic Parkinson's disease patients. These results highlight the potential of IL6 as progression marker in Parkinson's disease due to PRKN/PINK1 mutations; (iii) increased circulating cell-free mtDNA serum levels in both patients with biallelic or with heterozygous PRKN/PINK1 mutations compared to idiopathic Parkinson's disease, which is in line with previous findings in murine models. By contrast, circulating cell-free mtDNA concentrations in unaffected heterozygous carriers of PRKN/PINK1 mutations were comparable to control levels; and (iv) that circulating cell-free mtDNA levels have good predictive potential to discriminate between idiopathic Parkinson's disease and Parkinson's disease linked to heterozygous PRKN/PINK1 mutations, providing functional evidence for a role of heterozygous mutations in PRKN or PINK1 as Parkinson's disease risk factor. Taken together, our study further implicates inflammation due to impaired mitophagy and subsequent mtDNA release in the pathogenesis of PRKN/PINK1-linked Parkinson's disease. In individuals carrying mutations in PRKN/PINK1, IL6 and circulating cell-free mtDNA levels may serve as markers of Parkinson's disease state and progression, respectively. Finally, our study suggests that targeting the immune system with anti-inflammatory medication holds the potential to influence the disease course of Parkinson's disease, at least in this subset of patients.