A/B tests are typically analyzed via frequentist
p
-values and confidence intervals, but these inferences are wholly unreliable if users endogenously choose samples sizes by
continuously monitoring
...their tests. We define
always valid p
-values and confidence intervals that let users try to take advantage of data as fast as it becomes available, providing valid statistical inference whenever they make their decision. Always valid inference can be interpreted as a natural interface for a sequential hypothesis test, which empowers users to implement a modified test tailored to them. In particular, we show in an appropriate sense that the measures we develop trade off sample size and power efficiently, despite a lack of prior knowledge of the user’s relative preference between these two goals. We also use always valid
p
-values to obtain multiple hypothesis testing control in the sequential context. Our methodology has been implemented in a large-scale commercial A/B testing platform to analyze hundreds of thousands of experiments to date.
Multiple hypothesis testing is a fundamental problem in high-dimensional inference, with wide applications in many scientific fields. In genome-wide association studies, tens of thousands of tests ...are performed simultaneously to find if any single-nucleotide polymorphisms (SNPs) are associated with some traits and those tests are correlated. When test statistics are correlated, false discovery control becomes very challenging under arbitrary dependence. In this article, we propose a novel method—based on principal factor approximation—that successfully subtracts the common dependence and weakens significantly the correlation structure, to deal with an arbitrary dependence structure. We derive an approximate expression for false discovery proportion (FDP) in large-scale multiple testing when a common threshold is used and provide a consistent estimate of realized FDP. This result has important applications in controlling false discovery rate and FDP. Our estimate of realized FDP compares favorably with Efron's approach, as demonstrated in the simulated examples. Our approach is further illustrated by some real data applications. We also propose a dependence-adjusted procedure that is more powerful than the fixed-threshold procedure. Supplementary material for this article is available online.
Full text
Available for:
BFBNIB, GIS, IJS, INZLJ, KISLJ, NMLJ, NUK, PNG, SAZU, UL, UM, UPUK, ZRSKP
We study a hypothesis testing problem in which data are compressed distributively and sent to a detector that seeks to decide between two possible distributions for the data. The aim is to ...characterize all achievable encoding rates and exponents of the type 2 error probability when the type 1 error probability is at most a fixed value. For related problems in distributed source coding, schemes based on random binning perform well and are often optimal. For distributed hypothesis testing, however, the use of binning is hindered by the fact that the overall error probability may be dominated by errors in the binning process. We show that despite this complication, binning is optimal for a class of problems in which the goal is to "test against conditional independence." We then use this optimality result to give an outer bound for a more general class of instances of the problem.
Complex multivariate testing problems are frequently encountered in many scientific disciplines, such as engineering, medicine and the social sciences. As a result, modern statistics needs ...permutation testing for complex data with low sample size and many variables, especially in observational studies.
Gaussian differential privacy Dong, Jinshuo; Roth, Aaron; Su, Weijie J.
Journal of the Royal Statistical Society. Series B, Statistical methodology,
February 2022, 2022-02-01, 20220201, Volume:
84, Issue:
1
Journal Article
Peer reviewed
Open access
In the past decade, differential privacy has seen remarkable success as a rigorous and practical formalization of data privacy. This privacy definition and its divergence based relaxations, however, ...have several acknowledged weaknesses, either in handling composition of private algorithms or in analysing important primitives like privacy amplification by subsampling. Inspired by the hypothesis testing formulation of privacy, this paper proposes a new relaxation of differential privacy, which we term ‘f‐differential privacy’ (f‐DP). This notion of privacy has a number of appealing properties and, in particular, avoids difficulties associated with divergence based relaxations. First, f‐DP faithfully preserves the hypothesis testing interpretation of differential privacy, thereby making the privacy guarantees easily interpretable. In addition, f‐DP allows for lossless reasoning about composition in an algebraic fashion. Moreover, we provide a powerful technique to import existing results proven for the original differential privacy definition to f‐DP and, as an application of this technique, obtain a simple and easy‐to‐interpret theorem of privacy amplification by subsampling for f‐DP. In addition to the above findings, we introduce a canonical single‐parameter family of privacy notions within the f‐DP class that is referred to as ‘Gaussian differential privacy’ (GDP), defined based on hypothesis testing of two shifted Gaussian distributions. GDP is the focal privacy definition among the family of f‐DP guarantees due to a central limit theorem for differential privacy that we prove. More precisely, the privacy guarantees of any hypothesis testing based definition of privacy (including the original differential privacy definition) converges to GDP in the limit under composition. We also prove a Berry–Esseen style version of the central limit theorem, which gives a computationally inexpensive tool for tractably analysing the exact composition of private algorithms. Taken together, this collection of attractive properties render f‐DP a mathematically coherent, analytically tractable and versatile framework for private data analysis. Finally, we demonstrate the use of the tools we develop by giving an improved analysis of the privacy guarantees of noisy stochastic gradient descent.
Full text
Available for:
BFBNIB, FZAB, GIS, IJS, IZUM, KILJ, NLZOH, NUK, OILJ, PILJ, SAZU, SBCE, SBMB, UL, UM, UPUK
Abstract While hypothesis testing is a highly formalized activity, hypothesis generation remains largely informal. We propose a systematic procedure to generate novel hypotheses about human behavior, ...which uses the capacity of machine learning algorithms to notice patterns people might not. We illustrate the procedure with a concrete application: judge decisions about whom to jail. We begin with a striking fact: the defendant’s face alone matters greatly for the judge’s jailing decision. In fact, an algorithm given only the pixels in the defendant’s mug shot accounts for up to half of the predictable variation. We develop a procedure that allows human subjects to interact with this black-box algorithm to produce hypotheses about what in the face influences judge decisions. The procedure generates hypotheses that are both interpretable and novel: they are not explained by demographics (e.g., race) or existing psychology research, nor are they already known (even if tacitly) to people or experts. Though these results are specific, our procedure is general. It provides a way to produce novel, interpretable hypotheses from any high-dimensional data set (e.g., cell phones, satellites, online behavior, news headlines, corporate filings, and high-frequency time series). A central tenet of our article is that hypothesis generation is a valuable activity, and we hope this encourages future work in this largely “prescientific” stage of science.
Methods Matter Brodeur, Abel; Cook, Nikolai; Heyes, Anthony
The American economic review,
11/2020, Volume:
110, Issue:
11
Journal Article
Peer reviewed
The credibility revolution in economics has promoted causal identification using randomized control trials (RCT), difference-in-differences (DID), instrumental variables (IV) and regression ...discontinuity design (RDD). Applying multiple approaches to over 21,000 hypothesis tests published in 25 leading economics journals, we find that the extent of p-hacking and publication bias varies greatly by method. IV (and to a lesser extent DID) are particularly problematic. We find no evidence that (i) papers published in the Top 5 journals are different to others; (ii) the journal “revise and resubmit” process mitigates the problem; (iii) things are improving through time.
Full text
Available for:
BFBNIB, CEKLJ, INZLJ, IZUM, KILJ, NMLJ, NUK, PILJ, PNG, SAZU, UL, UM, UPUK, ZRSKP