•We present an exhaustive evaluation of Guided Regularized Random Forest (GRRF), a feature selection method based on Random Forest.•GRRF does not require fixing a priori the number of features to be ...selected or setting a threshold of the feature importance.•GRRF features provide similar (or slightly better) results than when using all the features.•Comparing GRRF and RF features, the mean overall accuracy increases by almost 6% in classification and, the RMSE decreases by almost 2% in regression.
New Earth observation missions and technologies are delivering large amounts of data. Processing this data requires developing and evaluating novel dimensionality reduction approaches to identify the most informative features for classification and regression tasks. Here we present an exhaustive evaluation of Guided Regularized Random Forest (GRRF), a feature selection method based on Random Forest. GRRF does not require fixing a priori the number of features to be selected or setting a threshold of the feature importance. Moreover, the use of regularization ensures that features selected by GRRF are non-redundant and representative. Our experiments based on various kinds of remote sensing images, show that GRRF selected features provides similar results to those obtained when using all the available features. However, the comparison between GRRF and standard random forest features shows substantial differences: in classification, the mean overall accuracy increases by almost 6% and, in regression, the decrease in RMSE almost reaches 2%. These results demonstrate the potential of GRRF for remote sensing image classification and regression. Especially in the context of increasingly large geodatabases that challenge the application of traditional methods.
Random cross-validation (CV) is often used to evaluate geospatial machine learning models, particularly when a limited amount of sample data are available, and collecting an extra test set is ...unfeasible. However, the prediction locations can be substantially different from the available sample, leading to over-optimistic evaluation results. This has fostered the development of spatial CV methods. Yet these methods only focus on spatial autocorrelation and cannot sufficiently guarantee that the validation subset is a good proxy of the test set with significant differences. In this paper, we propose the spatial+ cross-validation (SP-CV) method. This method, which considers both the geographic and feature spaces, is composed of two stages. The first stage addresses spatial autocorrelation issues by using agglomerative hierarchical clustering to divide the available sample into blocks. The second stage deals with multiple sources of differences. It uses cluster ensembles to split the blocks into training and validation folds based on the locations of the sample data and the values of the covariates and target variable. The proposed method is compared against random and block CV methods in a series of experiments with Amazon basin above ground biomass and California houseprice datasets. Our results show that SP-CV provided the smallest error differences with respect to the reference error. This means that SP-CV produced more representative splits and led to more reliable model evaluations. It suggests that a reliable model evaluation requires to consider both the geographic and the feature spaces in a comprehensive manner.
•We propose a novel cross validation method, SP-CV, for evaluating models.•SP-CV split samples by considering both the geographic and feature spaces.•SP-CV can produce more rational folds split results.•SP-CV can provide more accurate evaluation results.
Tick populations and tick-borne infections have steadily increased since the mid-1990s posing an ever-increasing risk to public health. Yet, modelling tick dynamics remains challenging because of the ...lack of data and knowledge on this complex phenomenon. Here we present an approach to model and map tick dynamics using volunteered data. This approach is illustrated with 9 years of data collected by a group of trained volunteers who sampled active questing ticks (AQT) on a monthly basis and for 15 locations in the Netherlands. We aimed at finding the main environmental drivers of AQT at multiple time-scales, and to devise daily AQT maps at the national level for 2014.
Tick dynamics is a complex ecological problem driven by biotic (e.g. pathogens, wildlife, humans) and abiotic (e.g. weather, landscape) factors. We enriched the volunteered AQT collection with six types of weather variables (aggregated at 11 temporal scales), three types of satellite-derived vegetation indices, land cover, and mast years. Then, we applied a feature engineering process to derive a set of 101 features to characterize the conditions that yielded a particular count of AQT on a date and location. To devise models predicting the AQT, we use a time-aware Random Forest regression method, which is suitable to find non-linear relationships in complex ecological problems, and provides an estimation of the most important features to predict the AQT.
We trained a model capable of fitting AQT with reduced statistical metrics. The multi-temporal study on the feature importance indicates that variables linked to water levels in the atmosphere (i.e. evapotranspiration, relative humidity) consistently showed a higher explanatory power than previous works using temperature. As a product of this study, we are able of mapping daily tick dynamics at the national level.
This study paves the way towards the design of new applications in the fields of environmental research, nature management, and public health. It also illustrates how Citizen Science initiatives produce geospatial data collections that can support scientific analysis, thus enabling the monitoring of complex environmental phenomena.