The probability density function (PDF) of wind speed is important in numerous wind energy applications. A large number of studies have been published in scientific literature related to renewable ...energies that propose the use of a variety of PDFs to describe wind speed frequency distributions. In this paper a review of these PDFs is carried out. The flexibility and usefulness of the PDFs in the description of different wind regimes (high frequencies of null winds, unimodal, bimodal, bitangential regimes, etc.) is analysed for a wide collection of models. Likewise, the methods that have been used to estimate the parameters on which these models depend are reviewed and the degree of complexity of the estimation is analysed in function of the model selected: these are the method of moments (MM), the maximum likelihood method (MLM) and the least squares method (LSM). In addition, a review is conducted of the statistical tests employed to see whether a sample of wind data comes from a population with a particular probability distribution. With the purpose of cataloguing the various PDFs, a comparison is made between them and the two parameter Weibull distribution (W.pdf), which has been the most widely used and accepted distribution in the specialised literature on wind energy and other renewable energy sources. This comparison is based on: (a) an analysis of the degree of fit of the continuous cumulative distribution functions (CDFs) for wind speed to the cumulative relative frequency histograms of hourly mean wind speeds recorded at weather stations located in the Canarian Archipelago; (b) an analysis of the degree of fit of the CDFs for wind power density to the cumulative relative frequency histograms of the cube of hourly mean wind speeds recorded at the aforementioned weather stations. The suitability of the distributions is judged from the coefficient of determination R2. Amongst the various conclusions obtained, it can be stated that the W.pdf presents a series of advantages with respect to the other PDFs analysed. However, the W.pdf cannot represent all the wind regimes encountered in nature such as, for example, those with high percentages of null wind speeds, bimodal distributions, etc. Therefore, its generalised use is not justified and it will be necessary to select the appropriate PDF for each wind regime in order to minimise errors in the estimation of the energy produced by a WECS (wind energy conversion system). In this sense, the extensive collection of PDFs proposed in this paper comprises a valuable catalogue.
Abstract
Many researchers want to report an $R^{2}$ to measure the variance explained by a model. When the model includes correlation among data, such as phylogenetic models and mixed models, ...defining an $R^{2}$ faces two conceptual problems. (i) It is unclear how to measure the variance explained by predictor (independent) variables when the model contains covariances. (ii) Researchers may want the $R^{2}$ to include the variance explained by the covariances by asking questions such as “How much of the data is explained by phylogeny?” Here, I investigated three $R^{2}$s for phylogenetic and mixed models. $R^{2}_{resid}$ is an extension of the ordinary least-squares $R^{2}$ that weights residuals by variances and covariances estimated by the model; it is closely related to $R^{2}_{glmm}$ presented by Nakagawa and Schielzeth (2013. A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods Ecol. Evol. 4:133–142). $R^{2}_{pred}$ is based on predicting each residual from the fitted model and computing the variance between observed and predicted values. $R^{2}_{lik}$ is based on the likelihood of fitted models, and therefore, reflects the amount of information that the models contain. These three $R^{2}$s are formulated as partial $R^{2}$s, making it possible to compare the contributions of predictor variables and variance components (phylogenetic signal and random effects) to the fit of models. Because partial $R^{2}$s compare a full model with a reduced model without components of the full model, they are distinct from marginal $R^{2}$s that partition additive components of the variance. I assessed the properties of the $R^{2}$s for phylogenetic models using simulations for continuous and binary response data (phylogenetic generalized least squares and phylogenetic logistic regression). Because the $R^{2}$s are designed broadly for any model for correlated data, I also compared $R^{2}$s for linear mixed models and generalized linear mixed models. $R^{2}_{resid}$, $R^{2}_{pred}$, and $R^{2}_{lik}$ all have similar performance in describing the variance explained by different components of models. However, $R^{2}_{pred}$ gives the most direct answer to the question of how much variance in the data is explained by a model. $R^{2}_{resid}$ is most appropriate for comparing models fit to different data sets, because it does not depend on sample sizes. And $R^{2}_{lik}$ is most appropriate to assess the importance of different components within the same model applied to the same data, because it is most closely associated with statistical significance tests.
Celotno besedilo
Dostopno za:
BFBNIB, DOBA, IZUM, KILJ, NMLJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Regression analysis makes up a large part of supervised machine learning, and consists of the prediction of a continuous independent target from a set of other predictor variables. The difference ...between binary classification and regression is in the target range: in binary classification, the target can have only two values (usually encoded as 0 and 1), while in regression the target can have multiple values. Even if regression analysis has been employed in a huge number of machine learning studies, no consensus has been reached on a single, unified, standard metric to assess the results of the regression itself. Many studies employ the mean square error (MSE) and its rooted variant (RMSE), or the mean absolute error (MAE) and its percentage variant (MAPE). Although useful, these rates share a common drawback: since their values can range between zero and +infinity, a single value of them does not say much about the performance of the regression with respect to the distribution of the ground truth elements. In this study, we focus on two rates that actually generate a high score only if the majority of the elements of a ground truth group has been correctly predicted: the coefficient of determination (also known as R -squared or R 2 ) and the symmetric mean absolute percentage error (SMAPE). After showing their mathematical properties, we report a comparison between R 2 and SMAPE in several use cases and in two real medical scenarios. Our results demonstrate that the coefficient of determination ( R -squared) is more informative and truthful than SMAPE, and does not have the interpretability limitations of MSE, RMSE, MAE and MAPE. We therefore suggest the usage of R -squared as standard metric to evaluate regression analyses in any scientific domain.
Out-of-sample prediction is the acid test of predictive models, yet an independent test dataset is often not available for assessment of the prediction error. For this reason, out-of-sample ...performance is commonly estimated using data splitting algorithms such as cross-validation or the bootstrap. For quantitative outcomes, the ratio of variance explained to total variance can be summarized by the coefficient of determination or in-sample R
2
, which is easy to interpret and to compare across different outcome variables. As opposed to in-sample R
2
, out-of-sample R
2
has not been well defined and the variability on out-of-sample
R
̂
2
has been largely ignored. Usually only its point estimate is reported, hampering formal comparison of predictability of different outcome variables. Here we explicitly define out-of-sample R
2
as a comparison of two predictive models, provide an unbiased estimator and exploit recent theoretical advances on uncertainty of data splitting estimates to provide a standard error for
R
̂
2
. The performance of the estimators for R
2
and its standard error are investigated in a simulation study. We demonstrate our new method by constructing confidence intervals and comparing models for prediction of quantitative Brassica napus and Zea mays phenotypes based on gene expression data. Our method is available in the R-package oosse.
Summary
Intra‐class correlations (ICC) and repeatabilities (R) are fundamental statistics for quantifying the reproducibility of measurements and for understanding the structure of biological ...variation. Linear mixed effects models offer a versatile framework for estimating ICC and R. However, while point estimation and significance testing by likelihood ratio tests is straightforward, the quantification of uncertainty is not as easily achieved.
A further complication arises when the analysis is conducted on data with non‐Gaussian distributions because the separation of the mean and the variance is less clear‐cut for non‐Gaussian than for Gaussian models. Nonetheless, there are solutions to approximate repeatability for the most widely used families of generalized linear mixed models (GLMMs).
Here, we introduce the R package rptR for the estimation of ICC and R for Gaussian, binomial and Poisson‐distributed data. Uncertainty in estimators is quantified by parametric bootstrapping and significance testing is implemented by likelihood ratio tests and through permutation of residuals. The package allows control for fixed effects and thus the estimation of adjusted repeatabilities (that remove fixed effect variance from the estimate) and enhanced agreement repeatabilities (that add fixed effect variance to the denominator). Furthermore, repeatability can be estimated from random‐slope models. The package features convenient summary and plotting functions.
Besides repeatabilities, the package also allows the quantification of coefficients of determination R2 as well as of raw variance components. We present an example analysis to demonstrate the core features and discuss some of the limitations of rptR.
Summary
The use of both linear and generalized linear mixed‐effects models (LMMs and GLMMs) has become popular not only in social and medical sciences, but also in biological sciences, especially in ...the field of ecology and evolution. Information criteria, such as Akaike Information Criterion (AIC), are usually presented as model comparison tools for mixed‐effects models.
The presentation of ‘variance explained’ (R2) as a relevant summarizing statistic of mixed‐effects models, however, is rare, even though R2 is routinely reported for linear models (LMs) and also generalized linear models (GLMs). R2 has the extremely useful property of providing an absolute value for the goodness‐of‐fit of a model, which cannot be given by the information criteria. As a summary statistic that describes the amount of variance explained, R2 can also be a quantity of biological interest.
One reason for the under‐appreciation of R2 for mixed‐effects models lies in the fact that R2 can be defined in a number of ways. Furthermore, most definitions of R2 for mixed‐effects have theoretical problems (e.g. decreased or negative R2 values in larger models) and/or their use is hindered by practical difficulties (e.g. implementation).
Here, we make a case for the importance of reporting R2 for mixed‐effects models. We first provide the common definitions of R2 for LMs and GLMs and discuss the key problems associated with calculating R2 for mixed‐effects models. We then recommend a general and simple method for calculating two types of R2 (marginal and conditional R2) for both LMMs and GLMMs, which are less susceptible to common problems.
This method is illustrated by examples and can be widely employed by researchers in any fields of research, regardless of software packages used for fitting mixed‐effects models. The proposed method has the potential to facilitate the presentation of R2 for a wide range of circumstances.
Statistical inference, which relies on bootstrapping in partial least squares structural equation modeling (PLS-SEM), lies at the heart of developing practically relevant and academically rigorous ...theory. Inspection of PLS-SEM applications in European management research reveals that there is still much to be gained in terms of bootstrapping. This paper suggests several bootstrapping best practices and demonstrates how to conduct them for frequently encountered, yet often ignored, PLS-SEM situations such as the assessment of (non) direct effects, the comparison of effects, and the evaluation of the coefficient of determination.
Abstract
Generalized linear mixed models (GLMMs) have been widely used in contemporary ecology studies. However, determination of the relative importance of collinear predictors (i.e. fixed effects) ...to response variables is one of the challenges in GLMMs. Here, we developed a novel R package, glmm.hp, to decompose marginal R2 explained by fixed effects in GLMMs. The algorithm of glmm.hp is based on the recently proposed approach ‘average shared variance’ i.e. used for multivariate analysis. We explained the principle and demonstrated the use of this package by simulated dataset. The output of glmm.hp shows individual marginal R2s that can be used to evaluate the relative importance of predictors, which sums up to the overall marginal R2. Overall, we believe the glmm.hp package will be helpful in the interpretation of GLMM outcomes.
In Mendelian randomization, two single SNP-trait correlation-based methods have been developed to infer the causal direction between an exposure (e.g., a gene) and an outcome (e.g., a trait), called ...MR Steiger’s method and its recent extension called Causal Direction-Ratio (CD-Ratio). Here we propose an approach based on R2, the coefficient of determination, to combine information from multiple (possibly correlated) SNPs to simultaneously infer the presence and direction of a causal relationship between an exposure and an outcome. Our proposed method generalizes Steiger’s method from using a single SNP to multiple SNPs as IVs. It is especially useful in transcriptome-wide association studies (TWASs) (and similar applications) with typically small sample sizes for gene expression (or another molecular trait) data, providing a more flexible and powerful approach to inferring causal directions. It can be applied to GWAS summary data with a reference panel. We also discuss the influence of invalid IVs and introduce a new approach called R2S to select and remove invalid IVs (if any) to enhance the robustness. We compared the performance of the proposed method with existing methods in simulations to demonstrate its advantages. We applied the methods to identify causal genes for high/low-density lipoprotein cholesterol (HDL/LDL) using the individual-level GTEx gene expression data and UK Biobank GWAS data. The proposed method was able to confirm some well-known causal genes while identifying some novel ones. Additionally, we illustrated an application of the proposed method to GWAS summary to infer causal relationships between HDL/LDL and stroke/coronary artery disease (CAD).
To infer the causal direction between an exposure and an outcome, we propose a method based on R2, the coefficient of determination, generalizing Steiger’s method from using a single SNP to multiple (correlated) SNPs. The new method is especially suitable for transcriptome-wide association studies (TWAS) (and similar applications).
Spatially dependent data arises in many applications, and Gaussian processes are a popular modeling choice for these scenarios. While Bayesian analyses of these problems have proven to be successful, ...selecting prior distributions for these complex models remains a difficult task. In this work, we propose a principled approach for setting prior distributions on model variance components by placing a prior distribution on a measure of model fit. In particular, we derive the distribution of the prior coefficient of determination. Placing a beta prior distribution on this measure induces a generalized beta prime prior distribution on the global variance of the linear predictor in the model. This method can also be thought of as shrinking the fit towards the intercept‐only (null) model. We derive an efficient Gibbs sampler for the majority of the parameters and use Metropolis–Hasting updates for the others. Finally, the method is applied to a marine protection area dataset. We estimate the effect of marine policies on biodiversity and conclude that no‐take restrictions lead to a slight increase in biodiversity and that the majority of the variance in the linear predictor comes from the spatial effect.