We introduce the C++ application and R package ranger. The software is a fast implementation of random forests for high dimensional data. Ensembles of classification, regression and survival trees ...are supported. We describe the implementation, provide examples, validate the package with a reference implementation, and compare runtime and memory usage with other implementations. The new software proves to scale best with the number of features, samples, trees, and features tried for splitting. Finally, we show that ranger is the fastest and most memory efficient implementation of random forests to analyze data on the scale of a genome-wide association study.
In this article, we propose a novel entropy and confidence-based undersampling boosting (ECUBoost) framework to solve imbalanced problems. The boosting-based ensemble is combined with a new ...undersampling method to improve the generalization performance. To avoid losing informative samples during the data preprocessing of the boosting-based ensemble, both confidence and entropy are used in ECUBoost as benchmarks to ensure the validity and structural distribution of the majority samples during the undersampling. Furthermore, different from other iterative dynamic resampling methods, ECUBoost based on confidence can be applied to algorithms without iterations such as decision trees. Meanwhile, random forests are used as base classifiers in ECUBoost. Furthermore, experimental results on both artificial data sets and KEEL data sets prove the effectiveness of the proposed method.
As an ensemble model that consists of many independent decision trees, random forests generate predictions by feeding the input to internal trees and summarizing their outputs. The ensemble nature of ...the model helps random forests outperform any individual decision tree. However, it also leads to a poor model interpretability, which significantly hinders the model from being used in fields that require transparent and explainable predictions, such as medical diagnosis and financial fraud detection. The interpretation challenges stem from the variety and complexity of the contained decision trees. Each decision tree has its unique structure and properties, such as the features used in the tree and the feature threshold in each tree node. Thus, a data input may lead to a variety of decision paths. To understand how a final prediction is achieved, it is desired to understand and compare all decision paths in the context of all tree structures, which is a huge challenge for any users. In this paper, we propose a visual analytic system aiming at interpreting random forest models and predictions. In addition to providing users with all the tree information, we summarize the decision paths in random forests, which eventually reflects the working mechanism of the model and reduces users' mental burden of interpretation. To demonstrate the effectiveness of our system, two usage scenarios and a qualitative user study are conducted.
Over the past decades, classification models have proven to be essential machine learning tools given their potential and applicability in various domains. In these years, the north of the majority ...of the researchers had been to improve quantitative metrics, notwithstanding the lack of information about models' decisions such metrics convey. This paradigm has recently shifted, and strategies beyond tables and numbers to assist in interpreting models' decisions are increasing in importance. Part of this trend, visualization techniques have been extensively used to support classification models' interpretability, with a significant focus on rule-based models. Despite the advances, the existing approaches present limitations in terms of visual scalability, and the visualization of large and complex models, such as the ones produced by the Random Forest (RF) technique, remains a challenge. In this paper, we propose Explainable Matrix (ExMatrix), a novel visualization method for RF interpretability that can handle models with massive quantities of rules. It employs a simple yet powerful matrix-like visual metaphor, where rows are rules, columns are features, and cells are rules predicates, enabling the analysis of entire models and auditing classification results. ExMatrix applicability is confirmed via different examples, showing how it can be used in practice to promote RF models interpretability.
The understanding of many physical and engineering problems involves running complex computational models. Such models take as input a high number of numerical and physical explanatory variables. The ...information on these underlying input parameters is often limited or uncertain. It is therefore important, based on the relationships between the input variables and the output, to identify and prioritize the most influential inputs. One may use global sensitivity analysis (GSA) methods which aim at ranking input random variables according to their importance in the output uncertainty, or even quantify the global influence of a particular input on the output. Using sensitivity metrics to ignore less important parameters is a form of dimension reduction in the model’s input parameter space. This suggests the use of meta-modeling as a quantitative approach for nonparametric GSA, where the original input/output relation is first approximated using various statistical regression techniques. Subsequently, the main goal of our work is to provide a comprehensive review paper in the domain of sensitivity analysis focusing on some interesting connections between random forests and GSA. The idea is to use a random forests methodology as an efficient non-parametric approach for building meta-models that allow an efficient sensitivity analysis. Apart its easy applicability to regression problems, the random forests approach presents further strong advantages by its ability to implicitly deal with correlation and high dimensional data, to handle interactions between variables and to identify informative inputs using a permutation based RF variable importance index which is easy and fast to compute. We further review an adequate set of tools for quantifying variable importance which are then exploited to reduce the model’s dimension enabling otherwise infeasible sensibility analysis studies. Numerical results from several simulations and a data exploration on a real dataset are presented to illustrate the effectiveness of such an approach.
•Global Sensitivity Analysis ranks inputs according to their importance on output.•Random Forests is an efficient non-parametric approach for building meta-models.•Random forests variable importance measure is used to define sensitivity measures.•Provide a comprehensive review paper in sensitivity analysis using random forests.•Focus on connections between random forests and Global Sensitivity Analysis.
Inflation forecasting is an important but difficult task. Here, we explore advances in machine learning (ML) methods and the availability of new datasets to forecast U.S. inflation. Despite the ...skepticism in the previous literature, we show that ML models with a large number of covariates are systematically more accurate than the benchmarks. The ML method that deserves more attention is the random forest model, which dominates all other models. Its good performance is due not only to its specific method of variable selection but also the potential nonlinearities between past key macroeconomic variables and inflation.
Supplementary materials
for this article are available online.
Machine learning algorithms have very high predictive ability. However, no study has used machine learning to estimate historical concentrations of PM2.5 (particulate matter with aerodynamic ...diameter ≤ 2.5 μm) at daily time scale in China at a national level.
To estimate daily concentrations of PM2.5 across China during 2005–2016.
Daily ground-level PM2.5 data were obtained from 1479 stations across China during 2014–2016. Data on aerosol optical depth (AOD), meteorological conditions and other predictors were downloaded. A random forests model (non-parametric machine learning algorithms) and two traditional regression models were developed to estimate ground-level PM2.5 concentrations. The best-fit model was then utilized to estimate the daily concentrations of PM2.5 across China with a resolution of 0.1° (≈10 km) during 2005–2016.
The daily random forests model showed much higher predictive accuracy than the other two traditional regression models, explaining the majority of spatial variability in daily PM2.5 10-fold cross-validation (CV) R2 = 83%, root mean squared prediction error (RMSE) = 28.1 μg/m3. At the monthly and annual time-scale, the explained variability of average PM2.5 increased up to 86% (RMSE = 10.7 μg/m3 and 6.9 μg/m3, respectively).
Taking advantage of a novel application of modeling framework and the most recent ground-level PM2.5 observations, the machine learning method showed higher predictive ability than previous studies.
Random forests approach can be used to estimate historical exposure to PM2.5 in China with high accuracy.
Display omitted
•Historical exposure to PM2.5 across China during 2005–2016 was estimated using AOD.•The random forests model explained 83% of variability of ground measured PM2.5.•The machine learning method showed higher predictive ability than previous studies.
Brain tumor detection is an active area of research in brain image processing. In this work, a methodology is proposed to segment and classify the brain tumor using magnetic resonance images (MRI). ...Deep Neural Networks (DNN) based architecture is employed for tumor segmentation. In the proposed model, 07 layers are used for classification that consist of 03 convolutional, 03 ReLU and a softmax layer. First the input MR image is divided into multiple patches and then the center pixel value of each patch is supplied to the DNN. DNN assign labels according to center pixels and perform segmentation. Extensive experiments are performed using eight large scale benchmark datasets including BRATS 2012 (image dataset and synthetic dataset), 2013 (image dataset and synthetic dataset), 2014, 2015 and ISLES (Ischemic stroke lesion segmentation) 2015 and 2017. The results are validated on accuracy (ACC), sensitivity (SE), specificity (SP), Dice Similarity Coefficient (DSC), precision, false positive rate (FPR), true positive rate (TPR) and Jaccard similarity index (JSI) respectively.
•A new light-weight Deep Neural Networks approach for brain tumor segmentation.•Extensive evaluation of proposed model on eight challenging big datasets.•Proposed work achieves state-of-the-art accuracy on these benchmark datasets.•Comparison of presented work with sixteen existing techniques in the same domain.•Better results by proposed method without incurring heavy computational burden.
Random Forests (RFs) and Gradient Boosting Machines (GBMs) are popular approaches for habitat suitability modelling in environmental flow assessment. However, both present some limitations ...theoretically solved by alternative tree-based ensemble techniques (e.g. conditional RFs or oblique RFs). Among them, eXtreme Gradient Boosting machines (XGBoost) has proven to be another promising technique that mixes subroutines developed for RFs and GBMs. To inspect the capabilities of these alternative techniques, RFs and GBMs were compared with: conditional RFs, oblique RFs and XGBoost by modelling, at the micro-scale, the habitat suitability for the invasive bleak (Alburnus alburnus L.) and pumpkinseed (Lepomis gibbosus L.). XGBoost outperformed the other approaches, particularly conditional and oblique RFs, although there were no statistical differences with standard RFs and GBMs. The partial dependence plots highlighted the lacustrine origins of pumpkinseed and the preference for lentic habitats of bleak. However, the latter depicted a larger tolerance for rapid microhabitats found in run-type river segments, which is likely to hinder the management of flow regimes to control its invasion. The difference in the computational burden and, especially, the characteristics of datasets on microhabitat use (low data prevalence and high overlapping between categories) led us to conclude that, in the short term, XGBoost is not destined to replace properly optimised RFs and GBMs in the process of habitat suitability modelling at the micro-scale.
•Five tree-based ensemble technics are compared.•XGBoost outperforms Random Forests (RFs) & Gradient Boosting Machines (GBMs).•Oblique RFs over-fit the training data.•Habitat preferences impede the development of counteracting e-flow regimes.•There is no conclusive advantage of XGBoost to replace other tree-based techniques.
Few studies have estimated historical exposures to PM10 at a national scale in China using satellite-based aerosol optical depth (AOD). Also, long-term trends have not been investigated.
In this ...study, daily concentrations of PM10 over China during the past 12 years were estimated with the most recent ground monitoring data, AOD, land use information, weather data and a machine learning approach.
Daily measurements of PM10 during 2014–2016 were collected from 1479 sites in China. Two types of Moderate Resolution Imaging Spectroradiometer (MODIS) AOD data, land use information, and weather data were downloaded and merged. A random forests model (non-parametric machine learning algorithms) and two traditional regression models were developed and their predictive abilities were compared. The best model was applied to estimate daily concentrations of PM10 across China during 2005–2016 at 0.1⁰ (≈10 km).
Cross-validation showed our random forests model explained 78% of daily variability of PM10 root mean squared prediction error (RMSE) = 31.5 μg/m3. When aggregated into monthly and annual averages, the models captured 82% (RMSE = 19.3 μg/m3) and 81% (RMSE = 14.4 μg/m3) of the variability. The random forests model showed much higher predictive ability and lower bias than the other two regression models. Based on the predictions of random forests model, around one-third of China experienced with PM10 pollution exceeding Grade Ⅱ National Ambient Air Quality Standard (>70 μg/m3) in China during the past 12 years. The highest levels of estimated PM10 were present in the Taklamakan Desert of Xinjiang and Beijing-Tianjin metropolitan region, while the lowest were observed in Tibet, Yunnan and Hainan. Overall, the PM10 level in China peaked in 2006 and 2007, and declined since 2008.
This is the first study to estimate historical PM10 pollution using satellite-based AOD data in China with random forests model. The results can be applied to investigate the long-term health effects of PM10 in China.
Display omitted
•Random forests can be successfully used to predict levels of PM10 with AOD.•The random forests model explained 78% of daily variability of PM10.•One-third of China experienced with high PM10 pollution during the past 12 years.•The highest levels of estimated PM10 were present in the Taklamakan Desert.
Random forests can be successfully used to predict levels of PM10 with AOD. The random forests model explained 78% of daily variability of PM10.