We consider a multiple‐hypothesis testing setting where the hypotheses are ordered and one is only permitted to reject an initial contiguous block H1,…,Hk of hypotheses. A rejection rule in this ...setting amounts to a procedure for choosing the stopping point k. This setting is inspired by the sequential nature of many model selection problems, where choosing a stopping point or a model is equivalent to rejecting all hypotheses up to that point and none thereafter. We propose two new testing procedures and prove that they control the false discovery rate in the ordered testing setting. We also show how the methods can be applied to model selection by using recent results on p‐values in sequential model selection settings.
Multiple testing problems arising in modern scientific applications can involve simultaneously testing thousands or even millions of hypotheses, with relatively few true signals. In this article, we ...consider the multiple testing problem where prior information is available (for instance, from an earlier study under different experimental conditions), that can allow us to test the hypotheses as a ranked list to increase the number of discoveries. Given an ordered list of n hypotheses, the aim is to select a data-dependent cutoff k and declare the first k hypotheses to be statistically significant while bounding the false discovery rate (FDR). Generalizing several existing methods, we develop a family of "accumulation tests" to choose a cutoff k that adapts to the amount of signal at the top of the ranked list. We introduce a new method in this family, the HingeExp method, which offers higher power to detect true signals compared to existing techniques. Our theoretical results prove that these methods control a modified FDR on finite samples, and characterize the power of the methods in the family. We apply the tests to simulated data, including a high-dimensional model selection problem for linear regression. We also compare accumulation tests to existing methods for multiple testing on a real data problem of identifying differential gene expression over a dosage gradient. Supplementary materials for this article are available online.
Multiple hypothesis testing in experimental economics List, John A.; Shaikh, Azeem M.; Xu, Yang
Experimental economics : a journal of the Economic Science Association,
12/2019, Volume:
22, Issue:
4
Journal Article
Peer reviewed
Open access
The analysis of data from experiments in economics routinely involves testing multiple null hypotheses simultaneously. These different null hypotheses arise naturally in this setting for at least ...three different reasons: when there are multiple outcomes of interest and it is desired to determine on which of these outcomes a treatment has an effect; when the effect of a treatment may be heterogeneous in that it varies across subgroups defined by observed characteristics and it is desired to determine for which of these subgroups a treatment has an effect; and finally when there are multiple treatments of interest and it is desired to determine which treatments have an effect relative to either the control or relative to each of the other treatments. In this paper, we provide a bootstrap-based procedure for testing these null hypotheses simultaneously using experimental data in which simple random sampling is used to assign treatment status to units. Using the general results in Romano and Wolf (Ann Stat 38:598–633,
2010
), we show under weak assumptions that our procedure (1) asymptotically controls the familywise error rate—the probability of one or more false rejections—and (2) is asymptotically balanced in that the marginal probability of rejecting any true null hypothesis is approximately equal in large samples. Importantly, by incorporating information about dependence ignored in classical multiple testing procedures, such as the Bonferroni and Holm corrections, our procedure has much greater ability to detect truly false null hypotheses. In the presence of multiple treatments, we additionally show how to exploit logical restrictions across null hypotheses to further improve power. We illustrate our methodology by revisiting the study by Karlan and List (Am Econ Rev 97(5):1774–1793,
2007
) of why people give to charitable causes.
In many fields of science, we observe a response variable together with a large number of potential explanatory variables, and would like to be able to discover which variables are truly associated ...with the response. At the same time, we need to know that the false discovery rate (FDR)—the expected fraction of false discoveries among all discoveries—is not too high, in order to assure the scientist that most of the discoveries are indeed true and replicable. This paper introduces the knockoff filter, a new variable selection procedure controlling the FDR in the statistical linear model whenever there are at least as many observations as variables. This method achieves exact FDR control in finite sample settings no matter the design or covariates, the number of variables in the model, or the amplitudes of the unknown regression coefficients, and does not require any knowledge of the noise level. As the name suggests, the method operates by manufacturing knockoff variables that are cheap—their construction does not require any new data—and are designed to mimic the correlation structure found within the existing variables, in a way that allows for accurate FDR control, beyond what is possible with permutation-based methods. The method of knockoffs is very general and flexible, and can work with a broad class of test statistics. We test the method in combination with statistics from the Lasso for sparse regression, and obtain empirical results showing that the resulting method has far more power than existing selection rules when the proportion of null variables is high.
In this article, we present a statistical significance test for necessary conditions. This is an elaboration of necessary condition analysis (NCA), which is a data analysis approach that estimates ...the necessity effect size of a condition X for an outcome Y. NCA puts a ceiling on the data, representing the level of X that is necessary (but not sufficient) for a given level of Y. The empty space above the ceiling relative to the total empirical space characterizes the necessity effect size. We propose a statistical significance test that evaluates the evidence against the null hypothesis of an effect being due to chance. Such a randomness test helps protect researchers from making Type 1 errors and drawing false positive conclusions. The test is an “approximate permutation test.” The test is available in NCA software for R. We provide suggestions for further statistical development of NCA.
Unplanned optional stopping rules have been criticized for inflating Type I error rates under the null hypothesis significance testing (NHST) paradigm. Despite these criticisms, this research ...practice is not uncommon, probably because it appeals to researcher's intuition to collect more data to push an indecisive result into a decisive region. In this contribution, we investigate the properties of a procedure for Bayesian hypothesis testing that allows optional stopping with unlimited multiple testing, even after each participant. In this procedure, which we call Sequential Bayes Factors (SBFs), Bayes factors are computed until an a priori defined level of evidence is reached. This allows flexible sampling plans and is not dependent upon correct effect size guesses in an a priori power analysis. We investigated the long-term rate of misleading evidence, the average expected sample sizes, and the biasedness of effect size estimates when an SBF design is applied to a test of mean differences between 2 groups. Compared with optimal NHST, the SBF design typically needs 50% to 70% smaller samples to reach a conclusion about the presence of an effect, while having the same or lower long-term rate of wrong inference.
Translational Abstract
Unplanned optional stopping rules have been criticized for inflating Type I error rates under the null hypothesis significance testing (NHST) paradigm. Despite these criticisms this research practice is not uncommon, probably as it appeals to researcher's intuition to collect more data in order to push an indecisive result into a decisive region. In this contribution we investigate the properties of a procedure for Bayesian hypothesis testing that allows optional stopping with unlimited multiple testing, even after each participant. In this procedure, which we call Sequential Bayes Factors (SBF), Bayes factors are computed until an a priori defined level of evidence is reached. This allows flexible sampling plans and is not dependent upon correct effect size guesses in an a priori power analysis. We investigated the long-term rate of misleading evidence, the average expected sample sizes, and the biasedness of effect size estimates when an SBF design is applied to a test of mean differences between two groups. Compared with optimal NHST, the SBF design typically needs 50% to 70% smaller samples to reach a conclusion about the presence of an effect, while having the same or lower long-term rate of wrong inference.
Full text
Available for:
CEKLJ, FFLJ, NUK, ODKLJ, PEFLJ
Summary
This paper proposes general methods for the problem of multiple testing of a single hypothesis, with a standard goal of combining a number of $p$-values without making any assumptions about ...their dependence structure. A result by Rüschendorf (1982) and, independently, Meng (1993) implies that the $p$-values can be combined by scaling up their arithmetic mean by a factor of 2, and no smaller factor is sufficient in general. A similar result by Mattner about the geometric mean replaces 2 by e. Based on more recent developments in mathematical finance, specifically, robust risk aggregation techniques, we extend these results to generalized means; in particular, we show that $K$ $p$-values can be combined by scaling up their harmonic mean by a factor of $\log K$ asymptotically as $K$ tends to infinity. This leads to a generalized version of the Bonferroni–Holm procedure. We also explore methods using weighted averages of $p$-values. Finally, we discuss the efficiency of various methods of combining $p$-values and how to choose a suitable method in light of data and prior information.
•We show that “Keeping it maximal” comes with a cost for LMMs.•Maximal models may lose power if their complexity is not supported by the data.•Model selection can balance Type-I error rates with ...power.
Linear mixed-effects models have increasingly replaced mixed-model analyses of variance for statistical inference in factorial psycholinguistic experiments. Although LMMs have many advantages over ANOVA, like ANOVAs, setting them up for data analysis also requires some care. One simple option, when numerically possible, is to fit the full variance-covariance structure of random effects (the maximal model; Barr, Levy, Scheepers & Tily, 2013), presumably to keep Type I error down to the nominal α in the presence of random effects. Although it is true that fitting a model with only random intercepts may lead to higher Type I error, fitting a maximal model also has a cost: it can lead to a significant loss of power. We demonstrate this with simulations and suggest that for typical psychological and psycholinguistic data, higher power is achieved without inflating Type I error rate if a model selection criterion is used to select a random effect structure that is supported by the data.