The revival of the Gini importance? Nembrini, Stefano; König, Inke R; Wright, Marvin N
Bioinformatics,
11/2018, Volume:
34, Issue:
21
Journal Article
Peer reviewed
Open access
Abstract
Motivation
Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable ...importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible split points and high minor allele frequency.
Results
We set up a fast approach to debias impurity-based variable importance measures for classification, regression and survival forests. We show that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance. As a result, it is now possible to compute reliable importance estimates without the extra computing cost of permutations. Further, we combine the importance measure with a fast testing procedure, producing p-values for variable importance with almost no computational overhead to the creation of the random forest. Applications to gene expression and genome-wide association data show that the proposed method is powerful and computationally efficient.
Availability and implementation
The procedure is included in the ranger package, available at https://cran.r-project.org/package=ranger and https://github.com/imbs-hl/ranger.
Supplementary information
Supplementary data are available at Bioinformatics online.
What is precision medicine? König, Inke R; Fuchs, Oliver; Hansen, Gesine ...
European respiratory journal/The European respiratory journal
50, Issue:
4
Journal Article
Peer reviewed
Open access
The term "precision medicine" has become very popular over recent years, fuelled by scientific as well as political perspectives. Despite its popularity, its exact meaning, and how it is different ...from other popular terms such as "stratified medicine", "targeted therapy" or "deep phenotyping" remains unclear. Commonly applied definitions focus on the stratification of patients, sometimes referred to as a novel taxonomy, and this is derived using large-scale data including clinical, lifestyle, genetic and further biomarker information, thus going beyond the classical "signs-and-symptoms" approach.While these aspects are relevant, this description leaves open a number of questions. For example, when does precision medicine begin? In which way does the stratification of patients translate into better healthcare? And can precision medicine be viewed as the end-point of a novel stratification of patients, as implied, or is it rather a greater whole?To clarify this, the aim of this paper is to provide a more comprehensive definition that focuses on precision medicine as a process. It will be shown that this proposed framework incorporates the derivation of novel taxonomies and their role in healthcare as part of the cycle, but also covers related terms.
Random forests have often been claimed to uncover interaction effects. However, if and how interaction effects can be differentiated from marginal effects remains unclear. In extensive simulation ...studies, we investigate whether random forest variable importance measures capture or detect gene-gene interactions. With capturing interactions, we define the ability to identify a variable that acts through an interaction with another one, while detection is the ability to identify an interaction effect as such.
Of the single importance measures, the Gini importance captured interaction effects in most of the simulated scenarios, however, they were masked by marginal effects in other variables. With the permutation importance, the proportion of captured interactions was lower in all cases. Pairwise importance measures performed about equal, with a slight advantage for the joint variable importance method. However, the overall fraction of detected interactions was low. In almost all scenarios the detection fraction in a model with only marginal effects was larger than in a model with an interaction effect only.
Random forests are generally capable of capturing gene-gene interactions, but current variable importance measures are unable to detect them as interactions. In most of the cases, interactions are masked by marginal effects and interactions cannot be differentiated from marginal effects. Consequently, caution is warranted when claiming that random forests uncover interactions.
Motivation: Genome-wide association (GWA) studies have proven to be a successful approach for helping unravel the genetic basis of complex genetic diseases. However, the identified associations are ...not well suited for disease prediction, and only a modest portion of the heritability can be explained for most diseases, such as Type 2 diabetes or Crohn's disease. This may partly be due to the low power of standard statistical approaches to detect gene–gene and gene–environment interactions when small marginal effects are present. A promising alternative is Random Forests, which have already been successfully applied in candidate gene analyses. Important single nucleotide polymorphisms are detected by permutation importance measures. To this day, the application to GWA data was highly cumbersome with existing implementations because of the high computational burden. Results: Here, we present the new freely available software package Random Jungle (RJ), which facilitates the rapid analysis of GWA data. The program yields valid results and computes up to 159 times faster than the fastest alternative implementation, while still maintaining all options of other programs. Specifically, it offers the different permutation importance measures available. It includes new options such as the backward elimination method. We illustrate the application of RJ to a GWA of Crohn's disease. The most important single nucleotide polymorphisms (SNPs) validate recent findings in the literature and reveal potential interactions. Availability: The RJ software package is freely available at http://www.randomjungle.org Contact: inke.koenig@imbs.uni-luebeck.de; ziegler@imbs.uni-luebeck.de Supplementary information: Supplementary data are available at Bioinformatics online.
One component of precision medicine is to construct prediction models with their predicitve ability as high as possible, e.g. to enable individual risk prediction. In genetic epidemiology, complex ...diseases like coronary artery disease, rheumatoid arthritis, and type 2 diabetes, have a polygenic basis and a common assumption is that biological and genetic features affect the outcome under consideration via interactions. In the case of omics data, the use of standard approaches such as generalized linear models may be suboptimal and machine learning methods are appealing to make individual predictions. However, most of these algorithms focus mostly on main or marginal effects of the single features in a dataset. On the other hand, the detection of interacting features is an active area of research in the realm of genetic epidemiology. One big class of algorithms to detect interacting features is based on the multifactor dimensionality reduction (MDR). Here, we further develop the model-based MDR (MB-MDR), a powerful extension of the original MDR algorithm, to enable interaction empowered individual prediction.
Using a comprehensive simulation study we show that our new algorithm (median AUC: 0.66) can use information hidden in interactions and outperforms two other state-of-the-art algorithms, namely the Random Forest (median AUC: 0.54) and Elastic Net (median AUC: 0.50), if interactions are present in a scenario of two pairs of two features having small effects. The performance of these algorithms is comparable if no interactions are present. Further, we show that our new algorithm is applicable to real data by comparing the performance of the three algorithms on a dataset of rheumatoid arthritis cases and healthy controls. As our new algorithm is not only applicable to biological/genetic data but to all datasets with discrete features, it may have practical implications in other research fields where interactions between features have to be considered as well, and we made our method available as an R package ( https://github.com/imbs-hl/MBMDRClassifieR ).
The explicit use of interactions between features can improve the prediction performance and thus should be included in further attempts to move precision medicine forward.
There is increasing evidence for a role of inflammation in Parkinson's disease. Recent research in murine models suggests that parkin and PINK1 deficiency leads to impaired mitophagy, which causes ...the release of mitochondrial DNA (mtDNA), thereby triggering inflammation. Specifically, the CGAS (cyclic GMP-AMP synthase)-STING (stimulator of interferon genes) pathway mitigates activation of the innate immune system, quantifiable as increased interleukin-6 (IL6) levels. However, the role of IL6 and circulating cell-free mtDNA in unaffected and affected individuals harbouring mutations in PRKN/PINK1 and idiopathic Parkinson's disease patients remain elusive. We investigated IL6, C-reactive protein, and circulating cell-free mtDNA in serum of 245 participants in two cohorts from tertiary movement disorder centres. We performed a hypothesis-driven rank-based statistical approach adjusting for multiple testing. We detected (i) elevated IL6 levels in patients with biallelic PRKN/PINK1 mutations compared to healthy control subjects in a German cohort, supporting the concept of a role for inflammation in PRKN/PINK1-linked Parkinson's disease. In addition, the comparison of patients with biallelic and heterozygous mutations in PRKN/PINK1 suggests a gene dosage effect. The differences in IL6 levels were validated in a second independent Italian cohort; (ii) a correlation between IL6 levels and disease duration in carriers of PRKN/PINK1 mutations, while no such association was observed for idiopathic Parkinson's disease patients. These results highlight the potential of IL6 as progression marker in Parkinson's disease due to PRKN/PINK1 mutations; (iii) increased circulating cell-free mtDNA serum levels in both patients with biallelic or with heterozygous PRKN/PINK1 mutations compared to idiopathic Parkinson's disease, which is in line with previous findings in murine models. By contrast, circulating cell-free mtDNA concentrations in unaffected heterozygous carriers of PRKN/PINK1 mutations were comparable to control levels; and (iv) that circulating cell-free mtDNA levels have good predictive potential to discriminate between idiopathic Parkinson's disease and Parkinson's disease linked to heterozygous PRKN/PINK1 mutations, providing functional evidence for a role of heterozygous mutations in PRKN or PINK1 as Parkinson's disease risk factor. Taken together, our study further implicates inflammation due to impaired mitophagy and subsequent mtDNA release in the pathogenesis of PRKN/PINK1-linked Parkinson's disease. In individuals carrying mutations in PRKN/PINK1, IL6 and circulating cell-free mtDNA levels may serve as markers of Parkinson's disease state and progression, respectively. Finally, our study suggests that targeting the immune system with anti-inflammatory medication holds the potential to influence the disease course of Parkinson's disease, at least in this subset of patients.
Abstract
Aims
Transcatheter aortic valve implantation (TAVI) has emerged as established treatment option in patients with symptomatic aortic stenosis. Technical developments in valve design have ...addressed previous limitations such as suboptimal deployment, conduction disturbances, and paravalvular leakage. However, there are only limited data available for the comparison of newer generation self-expandable valve (SEV) and balloon-expandable valve (BEV).
Methods and results
SOLVE-TAVI is a multicentre, open-label, 2 × 2 factorial, randomized trial of 447 patients with aortic stenosis undergoing transfemoral TAVI comparing SEV (Evolut R, Medtronic Inc., Minneapolis, MN, USA) with BEV (Sapien 3, Edwards Lifesciences, Irvine, CA, USA). The primary efficacy composite endpoint of all-cause mortality, stroke, moderate/severe prosthetic valve regurgitation, and permanent pacemaker implantation at 30 days was powered for equivalence (equivalence margin 10% with significance level 0.05). The primary composite endpoint occurred in 28.4% of SEV patients and 26.1% of BEV patients meeting the prespecified criteria of equivalence rate difference −2.39 (90% confidence interval, CI −9.45 to 4.66); Pequivalence = 0.04. Event rates for the individual components were as follows: all-cause mortality 3.2% vs. 2.3% rate difference −0.93 (90% CI −4.78 to 2.92); Pequivalence < 0.001, stroke 0.5% vs. 4.7% rate difference 4.20 (90% CI 0.12 to 8.27); Pequivalence = 0.003, moderate/severe paravalvular leak 3.4% vs. 1.5% rate difference −1.89 (90% CI −5.86 to 2.08); Pequivalence = 0.0001, and permanent pacemaker implantation 23.0% vs. 19.2% rate difference −3.85 (90% CI −10.41 to 2.72) in SEV vs. BEV patients; Pequivalence = 0.06.
Conclusion
In patients with aortic stenosis undergoing transfemoral TAVI, newer generation SEV and BEV are equivalent for the primary valve-related efficacy endpoint. These findings support the safe application of these newer generation percutaneous valves in the majority of patients with some specific preferences based on individual valve anatomy.
Tubal factor infertility (TFI) accounts for more than 30% of the cases of female infertility and mostly resides from an inflammatory process triggered by an infection. Clinical appearances largely ...differ, and very often infections are not recognized or remain completely asymptomatic over time. Here, we characterized the microbial pattern in females diagnosed with infectious infertility (ININF) in comparison to females with non-infectious infertility (nININF), female sex workers (FSW) and healthy controls (fertile). Females diagnosed with infectious infertility differed significantly in the seroprevalence of IgG antibodies against the C. trachomatis proteins MOMP, OMP2, CPAF and HSP60 when compared to fertile females. Microbiota analysis using 16S amplicon sequencing of cervical swabs revealed significant differences between ININF and fertile controls in the relative read count of Gardnerella (10.08% vs. 5.43%). Alpha diversity varies among groups, which are characterized by community state types including Lactobacillus-dominated communities in fertile females, an increase in diversity in all the other groups and Gardnerella-dominated communities occurring more often in ININF. While all single parameters did not allow predicting infections as the cause of infertility, including C. trachomatis IgG/IgA status together with 16S rRNA gene analysis of the ten most frequent taxa a total of 93.8% of the females were correctly classified. Further studies are needed to unravel the impact of the cervical microbiota in the pathogenesis of infectious infertility and its potential for identifying females at risk earlier in life.
Coronavirus disease 2019 (COVID‐19) caused by infection with severe acute respiratory syndrome coronavirus 2 was first detected in Wuhan, China, in late 2019 and continues to spread worldwide. ...Persistent questions remain about the relationship between the severity of COVID‐19 and comorbid diseases, as well as other chronic pulmonary conditions. In this systematic review and meta‐analysis, we aimed to examine in detail whether the underlying chronic obstructive pulmonary diseases (COPD), asthma and chronic respiratory diseases (CRDs) were associated with an increased risk of more severe COVID‐19. A comprehensive literature search was performed using five international search engines. In the initial search, 722 articles were identified. After eliminating duplicate records and further consideration of eligibility criteria, 53 studies with 658,073 patients were included in the final analysis. COPD was present in 5.2% (2191/42,373) of patients with severe COVID‐19 and in 1.4% (4203/306,151) of patients with non‐severe COVID‐19 (random‐effects model; OR = 2.58, 95% CI = 1.99–3.34, Z = 7.15, p < 0.001). CRD was present in 8.6% (3780/44,041) of patients with severe COVID‐19 and in 5.7% (16,057/280,447) of patients with non‐severe COVID‐19 (random‐effects model; OR = 2.14, 95% CI = 1.74–2.64, Z = 7.1, p < 0.001). Asthma was present in 2.3% (1873/81,319) of patients with severe COVID‐19 and in 2.2% (11,796/538,737) of patients with non‐severe COVID‐19 (random‐effects model; OR = 1.13, 95% CI = 0.79–1.60, Z = 0.66, p = 0.50). In conclusion, comorbid COPD and CRD were clearly associated with a higher severity of COVID‐19; however, no association between asthma and severe COVID‐19 was identified.
See related Editorial