There is a growing interest in using reinforcement learning (RL) to personalize sequences of treatments in digital health to support users in adopting healthier behaviors. Such sequential ...decision-making problems involve decisions about when to treat and how to treat based on the user’s context (e.g., prior activity level, location, etc.). Online RL is a promising data-driven approach for this problem as it learns based on each user’s historical responses and uses that knowledge to personalize these decisions. However, to decide whether the RL algorithm should be included in an “optimized” intervention for real-world deployment, we must assess the data evidence indicating that the RL algorithm is actually personalizing the treatments to its users. Due to the stochasticity in the RL algorithm, one may get a false impression that it is learning in certain states and using this learning to provide specific treatments. We use a working definition of personalization and introduce a resampling-based methodology for investigating whether the personalization exhibited by the RL algorithm is an artifact of the RL algorithm stochasticity. We illustrate our methodology with a case study by analyzing the data from a physical activity clinical trial called HeartSteps, which included the use of an online RL algorithm. We demonstrate how our approach enhances data-driven truth-in-advertising of algorithm personalization both across all users as well as within specific users in the study.
Consider an infinite sequence of independent, uniformly chosen points from
0
,
1
d
. After looking at each point in the sequence, an overseer is allowed to either keep it or reject it, and this ...choice may depend on the locations of all previously kept points. However, the overseer must keep at least one of every two consecutive points. We call a sequence generated in this fashion a
two-thinning
sequence. Here, the purpose of the overseer is to control the discrepancy of the empirical distribution of points, that is, after selecting
n
points, to reduce the maximal deviation of the number of points inside any axis-parallel hyper-rectangle of volume
A
from
nA
. Our main result is an explicit low complexity two-thinning strategy which guarantees discrepancy of
O
(
log
2
d
+
1
n
)
for all
n
with high probability compare with
Θ
(
n
log
log
n
)
without thinning. The case
d
=
1
of this result answers a question of Benjamini. We also extend the construction to achieve the same asymptotic bound for (
1
+
β
)-thinning, a set-up in which rejecting is only allowed with probability
β
independently for each point. In addition, we suggest an improved and simplified strategy which we conjecture to guarantee discrepancy of
O
(
log
d
+
1
n
)
compare with
θ
(
log
d
n
)
, the best known construction of a low discrepancy sequence. Finally, we provide theoretical and empirical evidence for our conjecture, and provide simulations supporting the viability of our construction for applications.
The growth in the number of algorithms to identify patterns in modern large-scale datasets has introduced a new dilemma for practitioners: How does one choose between the numerous methods? In ...supervised machine learning, accuracy on a hold-out dataset is the flagship for choice making. This dissertation presents research that can provide principled guidance for making choices in three popular settings where such a flagship measure is not readily available. (I) Convergence of Markov chain Monte Carlo sampling algorithms, used commonly in Bayesian inference, Monte Carlo integration, and stochastic simulation: We provide explicit non-asymptotic guarantees for state-of-the-art sampling algorithms in high dimensions that can help the user pick a sampling method and the number of iterations based on the computational budget at hand. (II) Statistical-computational challenges with mixture model estimation used commonly with heterogeneous data: We provide non-asymptotic guarantees with Expectation-Maximization for parameter estimation when the number of components is not known, and characterize the number of samples and iterations needed for the desired accuracy, that can inform the user of the potential two-edged price when dealing with noisy data in high dimensions. (III) Reliable estimation of heterogeneous treatment effects (HTE) in causal inference, crucial for decision making in medicine and public policy: We introduce a data-driven methodology StaDISC that is useful for validating commonly used models for estimating HTE, and for discovering interpretable and stable subgroups with HTE using calibration. While we illustrate its usefulness in precision medicine, we believe the methodology to be of general interest in randomized experiments.
Summary
Building on Yu and Kumbier's predictability, computability and stability (PCS) framework and for randomised experiments, we introduce a novel methodology for Stable Discovery of Interpretable ...Subgroups via Calibration (StaDISC), with large heterogeneous treatment effects. StaDISC was developed during our re‐analysis of the 1999–2000 VIGOR study, an 8076‐patient randomised controlled trial that compared the risk of adverse events from a then newly approved drug, rofecoxib (Vioxx), with that from an older drug naproxen. Vioxx was found to, on average and in comparison with naproxen, reduce the risk of gastrointestinal events but increase the risk of thrombotic cardiovascular events. Applying StaDISC, we fit 18 popular conditional average treatment effect (CATE) estimators for both outcomes and use calibration to demonstrate their poor global performance. However, they are locally well‐calibrated and stable, enabling the identification of patient groups with larger than (estimated) average treatment effects. In fact, StaDISC discovers three clinically interpretable subgroups each for the gastrointestinal outcome (totalling 29.4% of the study size) and the thrombotic cardiovascular outcome (totalling 11.0%). Complementary analyses of the found subgroups using the 2001–2004 APPROVe study, a separate independently conducted randomised controlled trial with 2587 patients, provide further supporting evidence for the promise of StaDISC.
Several estimation techniques assume validity of Gaussian approximations for estimation purposes. Interestingly, these ensemble methods have proven to work very well for high-dimensional data even ...when the distributions involved are not necessarily Gaussian. We attempt to bridge the gap between this oft-used computational assumption and the theoretical understanding of why this works, by employing some recent results on random projections on low dimensional subspaces and concentration inequalities.
We introduce kernel thinning, a new procedure for compressing a distribution \(\mathbb{P}\) more effectively than i.i.d. sampling or standard thinning. Given a suitable reproducing kernel ...\(\mathbf{k}_{\star}\) and \(O(n^2)\) time, kernel thinning compresses an \(n\)-point approximation to \(\mathbb{P}\) into a \(\sqrt{n}\)-point approximation with comparable worst-case integration error across the associated reproducing kernel Hilbert space. The maximum discrepancy in integration error is \(O_d(n^{-1/2}\sqrt{\log n})\) in probability for compactly supported \(\mathbb{P}\) and \(O_d(n^{-\frac{1}{2}} (\log n)^{(d+1)/2}\sqrt{\log\log n})\) for sub-exponential \(\mathbb{P}\) on \(\mathbb{R}^d\). In contrast, an equal-sized i.i.d. sample from \(\mathbb{P}\) suffers \(\Omega(n^{-1/4})\) integration error. Our sub-exponential guarantees resemble the classical quasi-Monte Carlo error rates for uniform \(\mathbb{P}\) on \(0,1^d\) but apply to general distributions on \(\mathbb{R}^d\) and a wide range of common kernels. Moreover, the same construction delivers near-optimal \(L^\infty\) coresets in \(O(n^2)\) time. We use our results to derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Matérn, and B-spline kernels and present two vignettes illustrating the practical benefits of kernel thinning over i.i.d. sampling and standard Markov chain Monte Carlo thinning, in dimensions \(d=2\) through \(100\).
The accurate evaluation of differential treatment in language models to specific groups is critical to ensuring a positive and safe user experience. An ideal evaluation should have the properties of ...being robust, extendable to new groups or attributes, and being able to capture biases that appear in typical usage (rather than just extreme, rare cases). Relatedly, bias evaluation should surface not only egregious biases but also ones that are subtle and commonplace, such as a likelihood for talking about appearances with regard to women. We present FairPair, an evaluation framework for assessing differential treatment that occurs during ordinary usage. FairPair operates through counterfactual pairs, but crucially, the paired continuations are grounded in the same demographic group, which ensures equivalent comparison. Additionally, unlike prior work, our method factors in the inherent variability that comes from the generation process itself by measuring the sampling variability. We present an evaluation of several commonly used generative models and a qualitative analysis that indicates a preference for discussing family and hobbies with regard to women.
Modern compression methods can summarize a target distribution \(\mathbb{P}\) more succinctly than i.i.d. sampling but require access to a low-bias input sequence like a Markov chain converging ...quickly to \(\mathbb{P}\). We introduce a new suite of compression methods suitable for compression with biased input sequences. Given \(n\) points targeting the wrong distribution and quadratic time, Stein kernel thinning (SKT) returns \(\sqrt{n}\) equal-weighted points with \(\widetilde{O}(n^{-1/2})\) maximum mean discrepancy (MMD) to \(\mathbb{P}\). For larger-scale compression tasks, low-rank SKT achieves the same feat in sub-quadratic time using an adaptive low-rank debiasing procedure that may be of independent interest. For downstream tasks that support simplex or constant-preserving weights, Stein recombination and Stein Cholesky achieve even greater parsimony, matching the guarantees of SKT with as few as \(\text{poly-log}(n)\) weighted points. Underlying these advances are new guarantees for the quality of simplex-weighted coresets, the spectral decay of kernel matrices, and the covering numbers of Stein kernel Hilbert spaces. In our experiments, our techniques provide succinct and accurate posterior summaries while overcoming biases due to burn-in, approximate Markov chain Monte Carlo, and tempering.
Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on \(n\) sample points. However, existing kernel tests either run in \(n^2\) time or ...sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each \(n\) point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test -- recovering the same optimal detection boundary -- while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20--200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.