We formulate and apply a novel paradigm for characterization of genome data quality, which quantifies the effects of intentional degradation of quality. The rationale is that the higher the initial ...quality, the more fragile the genome and the greater the effects of degradation. We demonstrate that this phenomenon is ubiquitous, and that quantified measures of degradation can be used for multiple purposes, illustrated by outlier detection. We focus on identifying outliers that may be problematic with respect to data quality, but might also be true anomalies or even attempts to subvert the database.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Researchers apply sampling weights to take account of unequal sample selection probabilities and to frame coverage errors and nonresponses. If researchers do not weight when appropriate, they risk ...having biased estimates. Alternatively, when they unnecessarily apply weights, they can create an inefficient estimator without reducing bias. Yet in practice researchers rarely test the necessity of weighting and are sometimes guided more by the current practice in their field than by scientific evidence. In addition, statistical tests for weighting are not widely known or available. This article reviews empirical tests to determine whether weighted analyses are justified. We focus on regression models, though the review's implications extend beyond regression. We find that nearly all weighting tests fall into two categories: difference in coefficients tests and weight association tests. We describe the distinguishing features of each category, present their properties, and explain the close relationship between them. We review the simulation evidence on their sampling properties in finite samples. Finally, we highlight the unanswered theoretical and practical questions that surround these tests and that deserve further research.
This paper is an attempt to understand the processes by which software ages. We define code to be aged or decayed if its structure makes it unnecessarily difficult to understand or change and we ...measure the extent of decay by counting the number of faults in code in a period of time. Using change management data from a very large, long-lived software system, we explore the extent to which measurements from the change history are successful in predicting the distribution over modules of these incidences of faults. In general, process measures based on the change history are more useful in predicting fault rates than product metrics of the code: For instance, the number of times code has been changed is a better indication of how many faults it will contain than is its length. We also compare the fault rates of code of various ages, finding that if a module is, on the average, a year older than an otherwise similar module, the older module will have roughly a third fewer faults. Our most successful model measures the fault potential of a module as the sum of contributions from all of the times the module has been changed, with large, recent changes receiving the most weight.
Reminiscences of Steve Fienberg Karr, Alan F.
The journal of privacy and confidentiality,
11/2018, Letnik:
8, Številka:
1
Journal Article
Recenzirano
Odprti dostop
My professional relationship with Steve began in the early 1990s, when I came to NISS as Associate Director and he was a member of the Board of Trustees. We sometimes disagreed, or perhaps more ...accurately, I failed to grasp his wisdom. Something must have worked, though, because Steve also chaired the committee that selected me to be Director of NISS.
Our scientific collaboration arose in late 1990s, when I was PI, and he co-PI, on two grants from NSF's Digital Government initiative. These grants, as did the entire collaboration, stemmed from Steve's fervent belief that deep mathematics can be brought to bear on pressing personal and societal problems. The first had to do with web-based query systems now known as restricted data access systems (RDAS), and specifically with table servers. We were frontiersmen together in formulating and applying risk-utility frontiers, released table frontiers and unreleasable table frontiers.
With his usual prescience, Steve knew before data breaches were daily news that privacy and confidentiality are major concerns. We wrote only a few papers together, but we exchanged sometimes wildly complementary ideas for more than twenty years. I still remember a meeting with a number of federal statistical agencies at which what I proposed as a risk measure was exactly what Steve construed as a utility measure.
From the science grew a multi-year, multi-continent friendship that drew in Joyce and Senora as well. It mattered not whether the last encounter was three weeks or three years ago. Sadly, only one of the four of us now remains, but in keeping with the advice of Dr. Seuss, instead of crying because what ended, I smile because what happened.
Many statistical organizations collect data that are expected to satisfy linear constraints; as examples, component variables should sum to total variables, and ratios of pairs of variables should be ...bounded by expert-specified constants. When reported data violate constraints, organizations identify and replace values potentially in error in a process known as edit-imputation. To date, most approaches separate the error localization and imputation steps, typically using optimization methods to identify the variables to change followed by hot deck imputation. We present an approach that fully integrates editing and imputation for continuous microdata under linear constraints. Our approach relies on a Bayesian hierarchical model that includes (i) a flexible joint probability model for the underlying true values of the data with support only on the set of values that satisfy all editing constraints, (ii) a model for latent indicators of the variables that are in error, and (iii) a model for the reported responses for variables in error. We illustrate the potential advantages of the Bayesian editing approach over existing approaches using simulation studies. We apply the model to edit faulty data from the 2007 U.S. Census of Manufactures. Supplementary materials for this article are available online.
A central feature of the evolution of large software systems is that change-which is necessary to add new functionality, accommodate new hardware, and repair faults-becomes increasingly difficult ...over time. We approach this phenomenon, which we term code decay, scientifically and statistically. We define code decay and propose a number of measurements (code decay indices) on software and on the organizations that produce it, that serve as symptoms, risk factors, and predictors of decay. Using an unusually rich data set (the fifteen-plus year change history of the millions of lines of software for a telephone switching system), we find mixed, but on the whole persuasive, statistical evidence of code decay, which is corroborated by developers of the code. Suggestive indications that perfective maintenance can retard code decay are also discussed.
Data Sharing and Access Karr, Alan F
Annual review of statistics and its application,
06/2016, Letnik:
3, Številka:
1
Journal Article
Recenzirano
Odprti dostop
Data sharing and access are venerable problems embedded in a rapidly changing milieu. Pressure points include the increasingly data-driven nature of science, the volume, complexity, and distributed ...nature of data, new concerns regarding privacy and confidentiality, and rising attention to reproducibility of research. In the context of research data, this review surveys extant technologies, articulates a number of identified and emerging issues, and outlines one path for the future. Recognizing that data availability is a public good, research data archives can provide economic and scientific value to both data generators and data consumers in a way that engenders trust. The overall framework is statistical-the use of data for inference.
For researchers and public health agencies, the complexity of high-dimensional spatio-temporal data in surveillance for large reporting networks presents numerous challenges, which include low ...signal-to-noise ratios, spatial and temporal dependencies, and the need to characterize uncertainties. Central to the problem in the context of disease outbreaks is a decision structure that requires trading off false positives for delayed detections.
In this paper we apply a previously developed Bayesian hierarchical model to a data set from the Indiana Public Health Emergency Surveillance System (PHESS) containing three years of emergency department visits for influenza-like illness and respiratory illness. Among issues requiring attention were selection of the underlying network (Too few nodes attenuate important structure, while too many nodes impose barriers to both modeling and computation.); ensuring that confidentiality protections in the data do not impede important modeling day of week effects; and evaluating the performance of the model.
Our results show that the model captures salient spatio-temporal dynamics that are present in public health surveillance data sets, and that it appears to detect both "annual" and "atypical" outbreaks in a timely, accurate manner. We present maps that help make model output accessible and comprehensible to public health authorities. We use an illustrative family of decision rules to show how output from the model can be used to inform false positive-delayed detection tradeoffs.
The advantages of our methodology for addressing the complicated issues of real world surveillance data applications are three-fold. We can easily incorporate additional covariate information and spatio-temporal dynamics in the data. Second, we furnish a unified framework to provide uncertainties associated with each parameter. Third, we are able to handle multiplicity issues by using a Bayesian approach. The urgent need to quickly and effectively monitor the health of the public makes our methodology a potentially plausible and useful surveillance approach for health professionals.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Business establishment microdata typically are required to satisfy agency-specified edit rules, such as balance equations and linear inequalities. Inevitably some establishments' reported data ...violate the edit rules. Statistical agencies correct faulty values using a process known as edit-imputation. Business establishment data also must be heavily redacted before being shared with the public; indeed, confidentiality concerns lead many agencies not to share establishment microdata as unrestricted access files. When microdata must be heavily redacted, one approach is to create synthetic data, as done in the U.S. Longitudinal Business Database and the German IAB Establishment Panel. This article presents the first implementation of a fully integrated approach to edit-imputation and data synthesis. We illustrate the approach on data from the U.S. Census of Manufactures and present a variety of evaluations of the utility of the synthetic data. The paper also presents assessments of disclosure risks for several intruder attacks. We find that the synthetic data preserve important distributional features from the post-editing confidential microdata, and have low risks for the various attacks.
Many statistical agencies, survey organizations, and research centers collect data that suffer from item nonresponse and erroneous or inconsistent values. These data may be required to satisfy linear ...constraints, for example, bounds on individual variables and inequalities for ratios or sums of variables. Often these constraints are designed to identify faulty values, which then are blanked and imputed. The data also may exhibit complex distributional features, including nonlinear relationships and highly nonnormal distributions. We present a fully Bayesian, joint model for modeling or imputing data with missing/blanked values under linear constraints that (i) automatically incorporates the constraints in inferences and imputations, and (ii) uses a flexible Dirichlet process mixture of multivariate normal distributions to reflect complex distributional features. Our strategy for estimation is to augment the observed data with draws from a hypothetical population in which the constraints are not present, thereby taking advantage of computationally expedient methods for fitting mixture models. Missing/blanked items are sampled from their posterior distribution using the Hit-and-Run sampler, which guarantees that all imputations satisfy the constraints. We illustrate the approach using manufacturing data from Colombia, examining the potential to preserve joint distributions and a regression from the plant productivity literature. Supplementary materials for this article are available online.