Misapplication of statistical data analysis is a common cause of spurious discoveries in scientific research. Existing approaches to ensuring the validity of inferences drawn from data assume a fixed ...procedure to be performed, selected before the data are examined. In common practice, however, data analysis is an intrinsically adaptive process, with new analyses generated on the basis of data exploration, as well as the results of previous analyses on the same data. We demonstrate a new approach for addressing the challenges of adaptivity based on insights from privacy-preserving data analysis. As an application, we show how to safely reuse a holdout data set many times to validate the results of adaptively chosen analyses.
Exposed! A Survey of Attacks on Private Data Dwork, Cynthia; Smith, Adam; Steinke, Thomas ...
Annual review of statistics and its application,
03/2017, Volume:
4, Issue:
1
Journal Article
Peer reviewed
Privacy-preserving statistical data analysis addresses the general question of protecting privacy when publicly releasing information about a sensitive dataset. A privacy attack takes seemingly ...innocuous released information and uses it to discern the private details of individuals, thus demonstrating that such information compromises privacy. For example, re-identification attacks have shown that it is easy to link supposedly de-identified records to the identity of the individual concerned. This survey focuses on attacking aggregate data, such as statistics about how many individuals have a certain disease, genetic trait, or combination thereof. We consider two types of attacks: reconstruction attacks, which approximately determine a sensitive feature of all the individuals covered by the dataset, and tracing attacks, which determine whether or not a target individual's data are included in the dataset. We also discuss techniques from the differential privacy literature for releasing approximate aggregate statistics while provably thwarting any privacy attack.
Differential privacy is at a turning point. Implementations have been successfully leveraged in private industry, the public sector, and academia in a wide variety of applications, allowing ...scientists, engineers, and researchers the ability to learn about populations of interest without specifically learning about these individuals. Because differential privacy allows us to quantify cumulative privacy loss, these differentially private systems will, for the first time, allow us to measure and compare the total privacy loss due to these personal data-intensive activities. Appropriately leveraged, this could be a watershed moment for privacy.
Like other technologies and techniques that allow for a range of instantiations, implementation details matter. When meaningfully implemented, differential privacy supports deep data-driven insights with minimal worst-case privacy loss. When not meaningfully implemented, differential privacy delivers privacy mostly in name. Using differential privacy to maximize learning while providing a meaningful degree of privacy requires judicious choices with respect to the privacy parameter epsilon, among other factors. However, there is little understanding of what is the optimal value of epsilon for a given system or classes of systems/purposes/data etc. or how to go about figuring it out.
To understand current differential privacy implementations and how organizations make these key choices in practice, we conducted interviews with practitioners to learn from their experiences of implementing differential privacy. We found no clear consensus on how to choose epsilon, nor is there agreement on how to approach this and other key implementation decisions. Given the importance of these implementation details there is a need for shared learning amongst the differential privacy community. To serve these purposes, we propose the creation of the Epsilon Registry—a publicly available communal body of knowledge about differential privacy implementations that can be used by various stakeholders to drive the identification and adoption of judicious differentially private implementations.
We continue a line of research initiated in 10,11on privacy-preserving statistical databases. Consider a trusted server that holds a database of sensitive information. Given a query function f ...mapping databases to reals, the so-called true answer is the result of applying f to the database. To protect privacy, the true answer is perturbed by the addition of random noise generated according to a carefully chosen distribution, and this response, the true answer plus noise, is returned to the user.
Previous work focused on the case of noisy sums, in which f = ∑ig(xi), where xi denotes the ith row of the database and g maps database rows to 0,1. We extend the study to general functions f, proving that privacy can be preserved by calibrating the standard deviation of the noise according to the sensitivity of the function f. Roughly speaking, this is the amount that any single argument to f can change its output. The new analysis shows that for several particular applications substantially less noise is needed than was previously understood to be the case.
The first step is a very clean characterization of privacy in terms of indistinguishability of transcripts. Additionally, we obtain separation results showing the increased value of interactive sanitization mechanisms over non-interactive.
The Journal of Privacy and Confidentiality (JPC) is the only journal to actively solicit contributions from the multi-faceted community of researchers and practitioners for whom privacy is a primary ...intellectual or operational concern, for dissemination across this broad community. This includes computer scientists, statisticians, lawyers, social scientists, policy-makers, health researchers, survey designers, and data-rich corporate players. While not every publication is aimed so broadly, the Journal aims to provide a common forum for all these constituent populations.
With the publication of the current issue we re-launch the Journal of Privacy and Confidentiality. We reaffirm our dedication to drawing from multiple disciplines in which privacy and confidentiality are of primary intellectual and operational concern, and to maintaining our status as an open access journal providing a forum for communication across and between these disciplines.
Differential Privacy Dwork, Cynthia
Automata, Languages and Programming,
2006
Book Chapter, Conference Proceeding
Peer reviewed
In 1977 Dalenius articulated a desideratum for statistical databases: nothing about an individual should be learnable from the database that cannot be learned without access to the database. We give ...a general impossibility result showing that a formalization of Dalenius’ goal along the lines of semantic security cannot be achieved. Contrary to intuition, a variant of the result threatens the privacy even of someone not in the database. This state of affairs suggests a new measure, differential privacy, which, intuitively, captures the increased risk to one’s privacy incurred by participating in a database. The techniques developed in a sequence of papers 8, 13, 3, culminating in those described in 12, can achieve any desired level of privacy under this measure. In many cases, extremely accurate information about the database can be provided while simultaneously ensuring very high levels of privacy.
Over the past five years a new approach to privacy-preserving data analysis has born fruit 13, 18, 7, 19, 5, 37, 35, 8, 32. This approach differs from much (but not all!) of the related literature in ...the statistics, databases, theory, and cryptography communities, in that a formal and ad omnia privacy guarantee is defined, and the data analysis techniques presented are rigorously proved to satisfy the guarantee. The key privacy guarantee that has emerged is differential privacy. Roughly speaking, this ensures that (almost, and quantifiably) no risk is incurred by joining a statistical database.
In this survey, we recall the definition of differential privacy and two basic techniques for achieving it. We then show some interesting applications of these techniques, presenting algorithms for three specific tasks and three general results on differentially private learning.
Differential privacy provides a rigorous framework for privacy-preserving data analysis. This paper proposes the first differentially private procedure for controlling the false discovery rate (FDR) ...in multiple hypothesis testing. Inspired by the Benjamini-Hochberg procedure (BHq), our approach is to first repeatedly add noise to the logarithms of the p-values to ensure differential privacy and to select an approximately smallest p-value serving as a promising candidate at each iteration; the selected p-values are further supplied to the BHq and our private procedure releases only the rejected ones. Moreover, we develop a new technique that is based on a backward submartingale for proving FDR control of a broad class of multiple testing procedures, including our private procedure, and both the BHq step- up and step-down procedures. As a novel aspect, the proof works for arbitrary dependence between the true null and false null test statistics, while FDR control is maintained up to a small multiplicative factor.