DIKUL - logo
E-resources
Full text
Peer reviewed
  • Neglecting spatial autocorr...
    Ferraciolli, Matheus A.; Bocca, Felipe F.; Rodrigues, Luiz Henrique A.

    Computers and electronics in agriculture, June 2019, 2019-06-00, 20190601, Volume: 161
    Journal Article

    Display omitted •We evaluated how auto-correlation affects machine learning sugarcane yield models.•We adapted the feature selection RReliefF algorithm for use with auto-correlated data.•Naive assumption of data-independence leads to underestimated generalization error.•Proposed protocol improves estimates of generalization error.•Model performance slightly improved without changing the machine learning techniques. With the increased application of information technology in agriculture, data is being produced and used in an unprecedented scale. While these advances, combined with machine learning techniques, benefited yield modeling, most of the current literature about data-driven yield modeling has not yet accounted for potential sources of correlation in data, assuming independence between samples. In this scenario, random sampling can lead to correlated samples across sets being used for model evaluation. We implemented a spatially-aware protocol and compared it with the naive approach of assuming independence between samples. The protocols were applied through all the model development pipeline: data splitting for hold-out sets, feature selection, cross-validation for model adjustment and model evaluation. Three different machine learning techniques were used to create models in each protocol. The resulting models were evaluated both in the validation set created by each protocol and in a manually created independent set. This independent set ensured there was no auto-correlation between the samples used for modeling. We showed that assuming independence when modeling yield leads to underestimating model errors and overfit during model adjustment. Despite better error tracking, the model with the smallest error in the test set was not the model with the smallest validation error, suggesting overfit for the model selection. While this effect was small for the spatially-aware protocol, the effect was a lot stronger in the naive protocol. Future efforts in yield modeling should address the effect of spatial auto-correlation and other potential sources of correlation to improve correctness and robustness of the results.