Unconditional Quantile Regressions Firpo, Sergio; Fortin, Nicole M.; Lemieux, Thomas
Econometrica,
20/May , Letnik:
77, Številka:
3
Journal Article
Recenzirano
Odprti dostop
We propose a new regression method to evaluate the impact of changes in the distribution of the explanatory variables on quantiles of the unconditional (marginal) distribution of an outcome variable. ...The proposed method consists of running a regression of the (recentered) influence function (RIF) of the unconditional quantile on the explanatory variables. The influence function, a widely used tool in robust estimation, is easily computed for quantiles, as well as for other distributional statistics. Our approach, thus, can be readily generalized to other distributional statistics.
Statistical regression analysis is a powerful and reliable method to determine the impact of one or several independent variable(s) on a dependent variable. It is the most widely used of all ...statistical methods and has broad applicability to numerous practical problems. However, various problems can arise, when for instance the sample size is too small, distributional assumptions are not fulfilled, the relationship between independent and dependent variables is vague or when there is an ambiguity of events. Moreover, the complexity of real-life problems often makes the underlying models inadequate, since information is frequently imprecise in many ways. To relax these rigidities, numerous researchers have modified and extended concepts of statistical regression analysis by means of concepts of fuzzy set theory. By now, there is a large number of papers on the topic of fuzzy regression analysis, especially concerning possibilistic, fuzzy least squares or machine learning approaches. Additionally, the variety of approaches includes probabilistic, logistic, type-2 and clusterwise fuzzy regression methods, among many others. Besides papers mainly devoted to advances in methodology, there are also several papers presenting case studies in various research fields. To structure this diversity of papers, proposals and applications we give in this paper a comprehensive systematic review and provide a bibliography on the topic of fuzzy regression analysis. Thus, the paper intends to consolidate the topic in order to aid new researchers in this area, focuses the field’s attention on key open questions, and highlights possible directions for future research.
•Comprehensive systematic review on the topic of fuzzy regression analysis.•Structuring and categorizing the diversity of papers, proposals and practical applications.•Extensive bibliography of 455 relevant articles.•Critical discussion of the presented methods and approaches.•Several directions for fruitful future research.
Fast Calibrated Additive Quantile Regression Fasiolo, Matteo; Wood, Simon N.; Zaffran, Margaux ...
Journal of the American Statistical Association,
07/2021, Letnik:
116, Številka:
535
Journal Article
Recenzirano
Odprti dostop
We propose a novel framework for fitting additive quantile regression models, which provides well-calibrated inference about the conditional quantiles and fast automatic estimation of the smoothing ...parameters, for model structures as diverse as those usable with distributional generalized additive models, while maintaining equivalent numerical efficiency and stability. The proposed methods are at once statistically rigorous and computationally efficient, because they are based on the general belief updating framework of Bissiri, Holmes, and Walker to loss based inference, but compute by adapting the stable fitting methods of Wood, Pya, and Säfken. We show how the pinball loss is statistically suboptimal relative to a novel smooth generalization, which also gives access to fast estimation methods. Further, we provide a novel calibration method for efficiently selecting the "learning rate" balancing the loss with the smoothing priors during inference, thereby obtaining reliable quantile uncertainty estimates. Our work was motivated by a probabilistic electricity load forecasting application, used here to demonstrate the proposed approach. The methods described here are implemented by the qgam R package, available on the Comprehensive R Archive Network (CRAN).
Supplementary materials
for this article are available online.
When the outcome is binary, psychologists often use nonlinear modeling strategies such as logit or probit. These strategies are often neither optimal nor justified when the objective is to estimate ...causal effects of experimental treatments. Researchers need to take extra steps to convert logit and probit coefficients into interpretable quantities, and when they do, these quantities often remain difficult to understand. Odds ratios, for instance, are described as obscure in many textbooks (e.g., Gelman & Hill, 2006, p. 83). I draw on econometric theory and established statistical findings to demonstrate that linear regression is generally the best strategy to estimate causal effects of treatments on binary outcomes. Linear regression coefficients are directly interpretable in terms of probabilities and, when interaction terms or fixed effects are included, linear regression is safer. I review the Neyman-Rubin causal model, which I use to prove analytically that linear regression yields unbiased estimates of treatment effects on binary outcomes. Then, I run simulations and analyze existing data on 24,191 students from 56 middle schools (Paluck, Shepherd, & Aronow, 2013) to illustrate the effectiveness of linear regression. Based on these grounds, I recommend that psychologists use linear regression to estimate treatment effects on binary outcomes.
Machine learning (ML) techniques have been utilized for the crop monitoring and yield estimation/prediction using remotely sensed data. However, these methods have been investigated less for yield ...prediction of some crops, such as silage maize, which can be cultivated at various times in different fields of an area. Inconsistency between fields for satellite-derived normalized difference vegetation index (NDVI) temporal profiles can lead to some difficulties in yield prediction methods using time series of remotely sensed data. Therefore, this research has investigated silage maize yield prediction based on time series of NDVI dataset derived from Landsat 8 OLI. This paper employed advanced ML techniques including boosted regression tree (BRT), random forest regression (RFR), support vector regression, and Gaussian process regression (GPR) approaches and compared their performance with some proposed conventional regression methods. For this purpose, the NDVI values of all silage maize fields were averaged and integrated to produce a two-dimensional dataset for each year. The ML techniques were employed 100 times and their evaluation metrics were used to evaluate their performances and also analyze their stability. Finally, all the results of each ML technique were averaged to produce silage maize yields. The comparisons between the results of these methods indicate that the BRT technique, with the average R value higher than 0.87, outperforms other ones for all years. It was followed by RFR with almost same performance as GPR technique. This research demonstrated that some advanced ML approaches can predict the silage maize yield and they are less sensitive to inconsistency of NDVI time series. The results also showed that RFR was the most stable method to predict the maize yield in 2015, while it was trained using 2013-2014 dataset.
To estimate the dynamic effects of an absorbing treatment, researchers often use two-way fixed effects regressions that include leads and lags of the treatment. We show that in settings with ...variation in treatment timing across units, the coefficient on a given lead or lag can be contaminated by effects from other periods, and apparent pretrends can arise solely from treatment effects heterogeneity. We propose an alternative estimator that is free of contamination, and illustrate the relative shortcomings of two-way fixed effects regressions with leads and lags through an empirical application.
Linear regressions with period and group fixed effects are widely used to estimate treatment effects. We show that they estimate weighted sums of the average treatment effects (ATE) in each group and ...period, with weights that may be negative. Due to the negative weights, the linear regression coefficient may for instance be negative while all the ATEs are positive. We propose another estimator that solves this issue. In the two applications we revisit, it is significantly different from the linear regression estimator.
We propose a new data-augmentation strategy for fully Bayesian inference in models with binomial likelihoods. The approach appeals to a new class of Pólya–Gamma distributions, which are constructed ...in detail. A variety of examples are presented to show the versatility of the method, including logistic regression, negative binomial regression, nonlinear mixed-effect models, and spatial models for count data. In each case, our data-augmentation strategy leads to simple, effective methods for posterior inference that (1) circumvent the need for analytic approximations, numerical integration, or Metropolis–Hastings; and (2) outperform other known data-augmentation strategies, both in ease of use and in computational efficiency. All methods, including an efficient sampler for the Pólya–Gamma distribution, are implemented in the R package BayesLogit . Supplementary materials for this article are available online.
Seasonal variations (SVs) affect the population density (PD), fate, and fitness of pathogens in environmental water resources and the public health impacts. Therefore, this study is aimed at applying ...machine learning intelligence (MLI) to predict the impacts of SVs on P. shigelloides population density (PDP) in the aquatic milieu. Physicochemical events (PEs) and PDP from three rivers acquired via standard microbiological and instrumental techniques across seasons were fitted to MLI algorithms (linear regression (LR), multiple linear regression (MR), random forest (RF), gradient boosted machine (GBM), neural network (NN), K-nearest neighbour (KNN), boosted regression tree (BRT), extreme gradient boosting (XGB) regression, support vector regression (SVR), decision tree regression (DTR), M5 pruned regression (M5P), artificial neural network (ANN) regression (with one 10-node hidden layer (ANN10), two 6- and 4-node hidden layers (ANN64), and two 5- and 5-node hidden layers (ANN55)), and elastic net regression (ENR)) to assess the implications of the SVs of PEs on aquatic PDP. The results showed that SVs significantly influenced PDP and PEs in the water (p < 0.0001), exhibiting a site-specific pattern. While MLI algorithms predicted PDP with differing absolute flux magnitudes for the contributing variables, DTR predicted the highest PDP value of 1.707 log unit, followed by XGB (1.637 log unit), but XGB (mean-squared-error (MSE) = 0.0025; root-mean-squared-error (RMSE) = 0.0501; R2 =0.998; medium absolute deviation (MAD) = 0.0275) outperformed other models in terms of regression metrics. Temperature and total suspended solids (TSS) ranked first and second as significant factors in predicting PDP in 53.3% (8/15) and 40% (6/15), respectively, of the models, based on the RMSE loss after permutations. Additionally, season ranked third among the 7 models, and turbidity (TBS) ranked fourth at 26.7% (4/15), as the primary significant factor for predicting PDP in the aquatic milieu. The results of this investigation demonstrated that MLI predictive modelling techniques can promisingly be exploited to complement the repetitive laboratory-based monitoring of PDP and other pathogens, especially in low-resource settings, in response to seasonal fluxes and can provide insights into the potential public health risks of emerging pathogens and TSS pollution (e.g., nanoparticles and micro- and nanoplastics) in the aquatic milieu. The model outputs provide low-cost and effective early warning information to assist watershed managers and fish farmers in making appropriate decisions about water resource protection, aquaculture management, and sustainable public health protection.
Display omitted
•Machine learning (ML) models were built for predicting Plesiomonas density (PDP).•ML regression models predicted PDP with different abilities.•The XGB & RF models displayed good performance/regression metrics in PDP forecasting.•Temperature, season, and total suspended solids (TSS) had a great influence on PDP.•ML models are promising for watershed and aquaculture management.