Abstract
Background
Missing data are unavoidable in epidemiological research, potentially leading to bias and loss of precision. Multiple imputation (MI) is widely advocated as an improvement over ...complete case analysis (CCA). However, contrary to widespread belief, CCA is preferable to MI in some situations.
Methods
We provide guidance on choice of analysis when data are incomplete. Using causal diagrams to depict missingness mechanisms, we describe when CCA will not be biased by missing data and compare MI and CCA, with respect to bias and efficiency, in a range of missing data situations. We illustrate selection of an appropriate method in practice.
Results
For most regression models, CCA gives unbiased results when the chance of being a complete case does not depend on the outcome after taking the covariates into consideration, which includes situations where data are missing not at random. Consequently, there are situations in which CCA analyses are unbiased while MI analyses, assuming missing at random (MAR), are biased. By contrast MI, unlike CCA, is valid for all MAR situations and has the potential to use information contained in the incomplete cases and auxiliary variables to reduce bias and/or improve precision. For this reason, MI was preferred over CCA in our real data example.
Conclusions
Choice of method for dealing with missing data is crucial for validity of conclusions, and should be based on careful consideration of the reasons for the missing data, missing data patterns and the availability of auxiliary information.
Semi‐continuous variables are characterized by a point mass at one value and a continuous range of values for remaining observations. An example is alcohol consumption quantity, with a spike of zeros ...representing non‐drinkers and positive values for drinkers. If multiple imputation is used to handle missing values for semi‐continuous variables, it is unclear how this should be implemented within the standard approaches of fully conditional specification (FCS) and multivariate normal imputation (MVNI). This question is brought into focus by the use of categorized versions of semi‐continuous exposure variables in analyses (eg, no drinking, drinking below binge level, binge drinking, heavy binge drinking), raising the question of how best to achieve congeniality between imputation and analysis models. We performed a simulation study comparing nine approaches for imputing semi‐continuous exposures requiring categorization for analysis. Three methods imputed the categories directly: ordinal logistic regression, and imputation of binary indicator variables representing the categories using MVNI (with two variants). Six methods (predictive mean matching, zero‐inflated binomial imputation, and two‐part imputation methods with variants in FCS and MVNI) imputed the semi‐continuous variable, with categories derived after imputation. The ordinal and zero‐inflated binomial methods had good performance across most scenarios, while MVNI methods requiring rounding after imputation did not perform well. There were mixed results for predictive mean matching and the two‐part methods, depending on whether the estimands were proportions or regression coefficients. The results highlight the need to consider the parameter of interest when selecting an imputation procedure.
In patient follow‐up studies, events of interest may take place between periodic clinical assessments and so the exact time of onset is not observed. Such events are known as “bounded” or ...“interval‐censored.” Methods for handling such events can be categorized as either (i) applying multiple imputation (MI) strategies or (ii) taking a full likelihood‐based (LB) approach. We focused on MI strategies, rather than LB methods, because of their flexibility. We evaluated MI strategies for bounded event times in a competing risks analysis, examining the extent to which interval boundaries, features of the data distribution and substantive analysis model are accounted for in the imputation model. Candidate imputation models were predictive mean matching (PMM); log‐normal regression with postimputation back‐transformation; normal regression with and without restrictions on the imputed values and Delord and Genin's method based on sampling from the cumulative incidence function. We used a simulation study to compare MI methods and one LB method when data were missing at random and missing not at random, also varying the proportion of missing data, and then applied the methods to a hematopoietic stem cell transplantation dataset. We found that cumulative incidence and median event time estimation were sensitive to model misspecification. In a competing risks analysis, we found that it is more important to account for features of the data distribution than to restrict imputed values based on interval boundaries or to ensure compatibility with the substantive analysis by sampling from the cumulative incidence function. We recommend MI by type 1 PMM.
Inferring causal effects of treatments is a central goal in many disciplines. The potential outcomes framework is a main statistical approach to causal inference, in which a causal effect is defined ...as a comparison of the potential outcomes of the same units under different treatment conditions. Because for each unit at most one of the potential outcomes is observed and the rest are missing, causal inference is inherently a missing data problem. Indeed, there is a close analogy in the terminology and the inferential framework between causal inference and missing data. Despite the intrinsic connection between the two subjects, statistical analyses of causal inference and missing data also have marked differences in aims, settings and methods. This article provides a systematic review of causal inference from the missing data perspective. Focusing on ignorable treatment assignment mechanisms, we discuss a wide range of causal inference methods that have analogues in missing data analysis, such as imputation, inverse probability weighting and doubly robust methods. Under each of the three modes of inference—Frequentist, Bayesian and Fisherian randomization—we present the general structure of inference for both finite-sample and super-population estimands, and illustrate via specific examples. We identify open questions to motivate more research to bridge the two fields.
Living in socioeconomic disadvantage has been conceptualised as a chronic stressor, although this contradicts evidence from studies using hair cortisol and cortisone as a measure of ...hypothalamus-pituitary-adrenal (HPA)11HPA axis: Hypothalamus Pituitary Adrenal axis axis activity. These studies used complete case analyses, ignoring the impact of missing data for inference, despite the high proportion of missing biomarker data. The methodological limitations of studies investigating the association between socioeconomic position (SEP)22SEP: Socio-economic Position defined as education, wealth, and social class and hair cortisol and cortisone are considered in this study by comparing three common methods to deal with missing data: (1) Complete Case Analysis (CCA),33CCA: Complete Case Analysis (2) Inverse Probability Weighting (IPW) 44IPW: Inverse Probability Weightingand (3) weighted Multiple Imputation (MI).55MI: Multiple Imputation This study examines if socioeconomic disadvantage is associated with higher levels of HPA axis activity as measured by hair cortisol and cortisone among older adults using three approaches for compensating for missing data.
Cortisol and cortisone levels in hair samples from 4573 participants in the 6th wave (2012–2013) of the English Longitudinal Study of Ageing (ELSA)66ELSA: English Longitudinal Study of Ageing were examined, in relation to education, wealth, and social class. We compared linear regression models with CCA, weighted and multiple imputed weighted linear regression models.
Social groups with certain characteristics (i.e., ethnic minorities, in routine and manual occupations, physically inactive, with poorer health, and smokers) were less likely to have hair cortisol and hair cortisone data compared to the most advantaged groups. We found a consistent pattern of higher levels of hair cortisol and cortisone among the most socioeconomically disadvantaged groups compared to the most advantaged groups. Complete case approaches to missing data underestimated the levels of hair cortisol in education and social class and the levels of hair cortisone in education, wealth, and social class in the most disadvantaged groups.
This study demonstrates that social disadvantage as measured by disadvantaged SEP is associated with increased HPA axis activity. The conceptualisation of social disadvantage as a chronic stressor may be valid and previous studies reporting no associations between SEP and hair cortisol may be biased due to their lack of consideration of missing data cases which showed the underrepresentation of disadvantaged social groups in the analyses. Future analyses using biosocial data may need to consider and adjust for missing data.
•Studies challenge the idea that social disadvantage is a chronic stressor.•Studies relied on complete case analyses and overlooked the impact of missing data.•We observed higher levels of cortisol and cortisone in socially disadvantaged groups.•Complete case analyses underestimated the socioeconomic effects on chronic stress.
Hidradenitis suppurativa (HS) is a complex inflammatory skin condition affecting 0.1–4% of the population that leads to permanent scarring in the axilla, inframammary region, groin, and buttocks. Its ...complex pathogenesis involves genetics, innate and adaptive immunity, microbiota, and environmental stimuli. Specific populations have a higher incidence of HS, including females and Black individuals and those with associated comorbidities. HS registries and biobanks have set standards for the documentation of clinical data in the context of clinical trials and outcomes research, but collection, documentation, and reporting of these important clinical and demographic variables are uncommon in HS laboratory research studies. Standardization in the laboratory setting is needed because it helps to elucidate the factors that contribute mechanistically to HS symptoms and pathophysiology. The purpose of this article is to begin to set the stage for standardized reporting in the laboratory setting. We discuss how clinical guidelines can inform laboratory research studies, and we highlight what additional information is necessary for the use of samples in the wet laboratory and interpretation of associated mechanistic data. Through standardized data collection and reporting, data harmonization between research studies will transform our understanding of HS and lead to novel discoveries that will positively impact patient care.
Display omitted
Driven by growing interest across the sciences, a large number of empirical studies have been conducted in recent years of the structure of networks ranging from the Internet and the World Wide Web ...to biological networks and social networks. The data produced by these experiments are often rich and multimodal, yet at the same time they may contain substantial measurement error1–7. Accurate analysis and understanding of networked systems requires a way of estimating the true structure of networks from such rich but noisy data8–15. Here we describe a technique that allows us to make optimal estimates of network structure from complex data in arbitrary formats, including cases where there may be measurements of many different types, repeated observations, contradictory observations, annotations or metadata, or missing data. We give example applications to two different social networks, one derived from face-to-face interactions and one from self-reported friendships.
Abstract
Data is an important element in the analysis of machine learning. It is usually measured based on observations and is also an indispensable element in training a model. Good preparation of ...data helps enhance the performance of analysis and is able to deliver reliable final results. However, lots of factors influence the dataset and some lead to the loss of some data. When some portion of the data is missing, it causes biases in the final prediction outcomes. In order to minimize the consequences of missing data, several data imputation methods are established to solve the problem. This paper will first talk about some basic concepts about missing data. In the following sections, the paper will present several popular data imputation methods, including complete case analysis, single imputation, and multiple imputations. Applications of some methods will be presented to see how they can be used in real analysis situations. Finally, the paper will talk about the limits of current data imputation methods.
Because of the internal malfunction of satellite sensors and poor atmospheric conditions such as thick cloud, the acquired remote sensing data often suffer from missing information, i.e., the data ...usability is greatly reduced. In this paper, a novel method of missing information reconstruction in remote sensing images is proposed. The unified spatial-temporal-spectral framework based on a deep convolutional neural network (CNN) employs a unified deep CNN combined with spatial-temporal-spectral supplementary information. In addition, to address the fact that most methods can only deal with a single missing information reconstruction task, the proposed approach can solve three typical missing information reconstruction tasks: 1) dead lines in Aqua Moderate Resolution Imaging Spectroradiometer band 6; 2) the Landsat Enhanced Thematic Mapper Plus scan line corrector-off problem; and 3) thick cloud removal. It should be noted that the proposed model can use multisource data (spatial, spectral, and temporal) as the input of the unified framework. The results of both simulated and real-data experiments demonstrate that the proposed model exhibits high effectiveness in the three missing information reconstruction tasks listed above.
In this article, we discuss the Posterior Predictive P-value (PPP) method in the presence of missing data, the Bayesian adaptation of the approximate fit indices RMSEA, CFI and TLI, as well as the ...Bayesian adaptation of the Wald test for nested models. Simulation studies are presented. We also illustrate how these new methods can be used to build BSEM models.