Choosing a suitable sample size in qualitative research is an area of conceptual debate and practical uncertainty. That sample size principles, guidelines and tools have been developed to enable ...researchers to set, and justify the acceptability of, their sample size is an indication that the issue constitutes an important marker of the quality of qualitative research. Nevertheless, research shows that sample size sufficiency reporting is often poor, if not absent, across a range of disciplinary fields.
A systematic analysis of single-interview-per-participant designs within three health-related journals from the disciplines of psychology, sociology and medicine, over a 15-year period, was conducted to examine whether and how sample sizes were justified and how sample size was characterised and discussed by authors. Data pertinent to sample size were extracted and analysed using qualitative and quantitative analytic techniques.
Our findings demonstrate that provision of sample size justifications in qualitative health research is limited; is not contingent on the number of interviews; and relates to the journal of publication. Defence of sample size was most frequently supported across all three journals with reference to the principle of saturation and to pragmatic considerations. Qualitative sample sizes were predominantly - and often without justification - characterised as insufficient (i.e., 'small') and discussed in the context of study limitations. Sample size insufficiency was seen to threaten the validity and generalizability of studies' results, with the latter being frequently conceived in nomothetic terms.
We recommend, firstly, that qualitative health researchers be more transparent about evaluations of their sample size sufficiency, situating these within broader and more encompassing assessments of data adequacy. Secondly, we invite researchers critically to consider how saturation parameters found in prior methodological studies and sample size community norms might best inform, and apply to, their own project and encourage that data adequacy is best appraised with reference to features that are intrinsic to the study at hand. Finally, those reviewing papers have a vital role in supporting and encouraging transparent study-specific reporting.
Abstract
The estimation of power in two-level models used to analyze data that are hierarchically structured is particularly complex because the outcome contains variance at two levels that is ...regressed on predictors at two levels. Methods for the estimation of power in two-level models have been based on formulas and Monte Carlo simulation. We provide a hands-on tutorial illustrating how a priori and post hoc power analyses for the most frequently used two-level models are conducted. We describe how a population model for the power analysis can be specified by using standardized input parameters and how the power analysis is implemented in SIMR, a very flexible power estimation method based on Monte Carlo simulation. Finally, we provide case-sensitive rules of thumb for deriving sufficient sample sizes as well as minimum detectable effect sizes that yield a power ≥ .80 for the effects and input parameters most frequently analyzed by psychologists. For medium variance components, the results indicate that with lower level (L1) sample sizes up to 30 and higher level (L2) sample sizes up to 200, medium and large fixed effects can be detected. However, small L2 direct- or cross-level interaction effects cannot be detected with up to 200 clusters. The tutorial and guidelines should be of help to researchers dealing with multilevel study designs such as individuals clustered within groups or repeated measurements clustered within individuals.
Translational Abstract
In psychological research, two-level models are used to analyze data that are hierarchically structured. Such hierarchies in data can occur when participants are clustered within groups or repeated measurements are made for the same participants. Hierarchically structured data lead to quite complex dependencies among variances: (a) the outcome variable contains variance at two different levels, (b) predictor variables at both levels relate to outcome variance at the respective level (direct effects), (c) the size of the effect of a predictor variable on the lower level can vary between clusters at the higher level and (d) this variation can be explained by predictor variables of the higher level (so called cross-level interaction effects). All these variances and their dependencies must be specified to estimate the likelihood of obtaining statistically significant effects in a two-level model-known as the statistical power. We provide a hands-on tutorial illustrating the specification of these parameters and the implementation of a power analysis in the statistical environment R. We also provide rules of thumb for the sample sizes necessary to detect an effect of a certain size with sufficient power.
An important step when designing an empirical study is to justify the sample size that will be collected. The key aim of a sample size justification for such studies is to explain how the collected ...data is expected to provide valuable information given the inferential goals of the researcher. In this overview article six approaches are discussed to justify the sample size in a quantitative empirical study: 1) collecting data from (almost) the entire population, 2) choosing a sample size based on resource constraints, 3) performing an a-priori power analysis, 4) planning for a desired accuracy, 5) using heuristics, or 6) explicitly acknowledging the absence of a justification. An important question to consider when justifying sample sizes is which effect sizes are deemed interesting, and the extent to which the data that is collected informs inferences about these effect sizes. Depending on the sample size justification chosen, researchers could consider 1) what the smallest effect size of interest is, 2) which minimal effect size will be statistically significant, 3) which effect sizes they expect (and what they base these expectations on), 4) which effect sizes would be rejected based on a confidence interval around the effect size, 5) which ranges of effects a study has sufficient power to detect based on a sensitivity power analysis, and 6) which effect sizes are expected in a specific research area. Researchers can use the guidelines presented in this article, for example by using the interactive form in the accompanying online Shiny app, to improve their sample size justification, and hopefully, align the informational value of a study with their inferential goals.
Cronbach’s alpha is one of the most widely used measures of reliability in the social and organizational sciences. Current practice is to report the sample value of Cronbach’s alpha reliability, but ...a confidence interval for the population reliability value also should be reported. The traditional confidence interval for the population value of Cronbach’s alpha makes an unnecessarily restrictive assumption that the multiple measurements have equal variances and equal covariances. We propose a confidence interval that does not require equal variances or equal covariances. The results of a simulation study demonstrated that the proposed method performed better than alternative methods. We also present some sample size formulas that approximate the sample size requirements for desired power or desired confidence interval precision. R functions are provided that can be used to implement the proposed confidence interval and sample size methods.
Structural equation modeling (SEM) is a widespread approach to test substantive hypotheses in psychology and other social sciences. However, most studies involving structural equation models neither ...report statistical power analysis as a criterion for sample size planning nor evaluate the achieved power of the performed tests. In this tutorial, we provide a step-by-step illustration of how a priori, post hoc, and compromise power analyses can be conducted for a range of different SEM applications. Using illustrative examples and the R package semPower, we demonstrate power analyses for hypotheses regarding overall model fit, global model comparisons, particular individual model parameters, and differences in multigroup contexts (such as in tests of measurement invariance). We encourage researchers to yield reliable-and thus more replicable-results based on thoughtful sample size planning, especially if small or medium-sized effects are expected.
Translational AbstractStructural equation modeling (SEM) is a widespread approach to test substantive hypotheses in psychology and other social sciences. Whenever hypothesis tests are performed, researchers should ensure that the sample size is sufficiently large to detect the hypothesized effect. Power analyses can be used to determine the required sample size to identify the effect of interest with a desired level of statistical power (i.e., the probability to reject an incorrect null hypothesis). Vice versa, power analyses can also be used to determine the achieved power of a test, given an effect and a particular sample size. However, most studies involving SEM neither conduct a power analysis to inform sample size planning nor evaluate the achieved power of the performed tests. In this tutorial, we show and illustrate how power analyses can be used to identify the required sample size to detect a certain effect of interest or to determine the probability of a conducted test to detect a certain effect. These analyses are exemplified regarding the overall model as well as regarding individual model parameters, whereby both, models referring to a single group as well as models assessing differences between multiple groups are considered.
This study estimates empirically derived guidelines for effect size interpretation for research in social psychology overall and sub‐disciplines within social psychology, based on analysis of the ...true distributions of the two types of effect size measures widely used in social psychology (correlation coefficient and standardized mean differences). Analysis of empirically derived distributions of 12,170 correlation coefficients and 6,447 Cohen's d statistics extracted from studies included in 134 published meta‐analyses revealed that the 25th, 50th, and 75th percentiles corresponded to correlation coefficient values of 0.12, 0.24, and 0.41 and to Cohen's d values of 0.15, 0.36, and 0.65 respectively. The analysis suggests that the widely used Cohen's guidelines tend to overestimate medium and large effect sizes. Empirically derived effect size distributions in social psychology overall and its sub‐disciplines can be used both for effect size interpretation and for sample size planning when other information about effect size is not available.
In prediction model research, external validation is needed to examine an existing model's performance using data independent to that for model development. Current external validation studies often ...suffer from small sample sizes and consequently imprecise predictive performance estimates. To address this, we propose how to determine the minimum sample size needed for a new external validation study of a prediction model for a binary outcome. Our calculations aim to precisely estimate calibration (Observed/Expected and calibration slope), discrimination (C‐statistic), and clinical utility (net benefit). For each measure, we propose closed‐form and iterative solutions for calculating the minimum sample size required. These require specifying: (i) target SEs (confidence interval widths) for each estimate of interest, (ii) the anticipated outcome event proportion in the validation population, (iii) the prediction model's anticipated (mis)calibration and variance of linear predictor values in the validation population, and (iv) potential risk thresholds for clinical decision‐making. The calculations can also be used to inform whether the sample size of an existing (already collected) dataset is adequate for external validation. We illustrate our proposal for external validation of a prediction model for mechanical heart valve failure with an expected outcome event proportion of 0.018. Calculations suggest at least 9835 participants (177 events) are required to precisely estimate the calibration and discrimination measures, with this number driven by the calibration slope criterion, which we anticipate will often be the case. Also, 6443 participants (116 events) are required to precisely estimate net benefit at a risk threshold of 8%. Software code is provided.