Health research using electronic health records (EHR) has gained popularity, but misclassification of EHR‐derived disease status and lack of representativeness of the study sample can result in ...substantial bias in effect estimates and can impact power and type I error. In this paper, we develop new strategies for handling disease status misclassification and selection bias in EHR‐based association studies. We first focus on each type of bias separately. For misclassification, we propose three novel likelihood‐based bias correction strategies. A distinguishing feature of the EHR setting is that misclassification may be related to patient‐varying factors, and the proposed methods leverage data in the EHR to estimate misclassification rates without gold standard labels. For addressing selection bias, we describe how calibration and inverse probability weighting methods from the survey sampling literature can be extended and applied to the EHR setting.
Addressing misclassification and selection biases simultaneously is a more challenging problem than dealing with each on its own, and we propose several new strategies. For all methods proposed, we derive valid standard error estimators and provide software for implementation. We provide a new suite of statistical estimation and inference strategies for addressing misclassification and selection bias simultaneously that is tailored to problems arising in EHR data analysis. We apply these methods to data from The Michigan Genomics Initiative, a longitudinal EHR‐linked biorepository.
Electronic health records (EHR) are not designed for population‐based research, but they provide easy and quick access to longitudinal health information for a large number of individuals. Many ...statistical methods have been proposed to account for selection bias, missing data, phenotyping errors, or other problems that arise in EHR data analysis. However, addressing multiple sources of bias simultaneously is challenging. We developed a methodological framework (R package, SAMBA) for jointly handling both selection bias and phenotype misclassification in the EHR setting that leverages external data sources. These methods assume factors related to selection and misclassification are fully observed, but these factors may be poorly understood and partially observed in practice. As a follow‐up to the methodological work, we demonstrate how to apply these methods for two real‐world case studies, and we evaluate their performance. In both examples, we use individual patient‐level data collected through the University of Michigan Health System and various external population‐based data sources. In case study (a), we explore the impact of these methods on estimated associations between gender and cancer diagnosis. In case study (b), we compare corrected associations between previously identified genetic loci and age‐related macular degeneration with gold standard external summary estimates. These case studies illustrate how to utilize diverse auxiliary information to achieve less biased inference in EHR‐based research.
Timely diagnostic testing for active SARS‐CoV‐2 viral infections is key to controlling the spread of the virus and preventing severe disease. A central public health challenge is defining test ...allocation strategies with limited resources. In this paper, we provide a mathematical framework for defining an optimal strategy for allocating viral diagnostic tests. The framework accounts for imperfect test results, selective testing in certain high‐risk patient populations, practical constraints in terms of budget and/or total number of available tests, and the purpose of testing. Our method is not only useful for detecting infections, but can also be used for long‐time surveillance to detect new outbreaks. In our proposed approach, tests can be allocated across population strata defined by symptom severity and other patient characteristics, allowing the test allocation plan to prioritize higher risk patient populations. We illustrate our framework using historical data from the initial wave of the COVID‐19 outbreak in New York City. We extend our proposed method to address the challenge of allocating two different types of diagnostic tests with different costs and accuracy, for example, the RT‐PCR and the rapid antigen test (RAT), under budget constraints. We show how this latter framework can be useful to reopening of college campuses where university administrators are challenged with finite resources for community surveillance. We provide a R Shiny web application allowing users to explore test allocation strategies across a variety of pandemic scenarios. This work can serve as a useful tool for guiding public health decision‐making at a community level and adapting testing plans to different stages of an epidemic. The conceptual framework has broader relevance beyond the current COVID‐19 pandemic.
Biobanks linked to electronic health records provide rich resources for health‐related research. With improvements in administrative and informatics infrastructure, the availability and utility of ...data from biobanks have dramatically increased. In this paper, we first aim to characterize the current landscape of available biobanks and to describe specific biobanks, including their place of origin, size, and data types. The development and accessibility of large‐scale biorepositories provide the opportunity to accelerate agnostic searches, expedite discoveries, and conduct hypothesis‐generating studies of disease‐treatment, disease‐exposure, and disease‐gene associations. Rather than designing and implementing a single study focused on a few targeted hypotheses, researchers can potentially use biobanks' existing resources to answer an expanded selection of exploratory questions as quickly as they can analyze them. However, there are many obvious and subtle challenges with the design and analysis of biobank‐based studies. Our second aim is to discuss statistical issues related to biobank research such as study design, sampling strategy, phenotype identification, and missing data. We focus our discussion on biobanks that are linked to electronic health records. Some of the analytic issues are illustrated using data from the Michigan Genomics Initiative and UK Biobank, two biobanks with two different recruitment mechanisms. We summarize the current body of literature for addressing these challenges and discuss some standing open problems. This work complements and extends recent reviews about biobank‐based research and serves as a resource catalog with analytical and practical guidance for statisticians, epidemiologists, and other medical researchers pursuing research using biobanks.
The COVID-19 pandemic has highlighted a need for better understanding of countries' vulnerability and resilience to not only pandemics but also disasters, climate change, and other systemic shocks. A ...comprehensive characterization of vulnerability can inform efforts to improve infrastructure and guide disaster response in the future. In this paper, we propose a data-driven framework for studying countries' vulnerability and resilience to incident disasters across multiple dimensions of society. To illustrate this methodology, we leverage the rich data landscape surrounding the COVID-19 pandemic to characterize observed resilience for several countries (USA, Brazil, India, Sweden, New Zealand, and Israel) as measured by pandemic impacts across a variety of social, economic, and political domains. We also assess how observed responses and outcomes (i.e., resilience) of the COVID-19 pandemic are associated with pre-pandemic characteristics or vulnerabilities, including (1) prior risk for adverse pandemic outcomes due to population density and age and (2) the systems in place prior to the pandemic that may impact the ability to respond to the crisis, including health infrastructure and economic capacity. Our work demonstrates the importance of viewing vulnerability and resilience in a multi-dimensional way, where a country's resources and outcomes related to vulnerability and resilience can differ dramatically across economic, political, and social domains. This work also highlights key gaps in our current understanding about vulnerability and resilience and a need for data-driven, context-specific assessments of disaster vulnerability in the future.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Infectious disease forecasting is of great interest to the public health community and policymakers, since forecasts can provide insight into disease dynamics in the near future and inform ...interventions. Due to delays in case reporting, however, forecasting models may often underestimate the current and future disease burden. In this paper, we propose a general framework for addressing reporting delay in disease forecasting efforts with the goal of improving forecasts. We propose strategies for leveraging either historical data on case reporting or external internet-based data to estimate the amount of reporting error. We then describe several approaches for adapting general forecasting pipelines to account for under- or over-reporting of cases. We apply these methods to address reporting delay in data on dengue fever cases in Puerto Rico from 1990 to 2009 and to reports of influenza-like illness (ILI) in the United States between 2010 and 2019. Through a simulation study, we compare method performance and evaluate robustness to assumption violations. Our results show that forecasting accuracy and prediction coverage almost always increase when correction methods are implemented to address reporting delay. Some of these methods required knowledge about the reporting error or high quality external data, which may not always be available. Provided alternatives include excluding recently-reported data and performing sensitivity analysis. This work provides intuition and guidance for handling delay in disease case reporting and may serve as a useful resource to inform practical infectious disease forecasting efforts.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Abstract Space scientists often face the question of whether data collected by different instruments are measurements of the same source population. This paper proposes a statistical validation ...method for evaluating the agreement between such related data sets. It offers a detailed case study focused on validating a new data set from the Interstellar Boundary Explorer (IBEX) mission, which serves as a practical how-to guide for similar analyses. Since 2008, the IBEX satellite has been gathering data on heliospheric energetic neutral atoms (ENAs) while being exposed to various sources of background noise, such as cosmic rays and solar energetic particles. The IBEX mission initially released only a qualified triple-coincidence (qABC) data product, which was designed to provide observations of ENAs free of background contamination. Further measurements revealed that the qABC data were in fact susceptible to contamination, having relatively low ENA counts and high background rates. To mitigate this issue, the mission team recently considered releasing a certain qualified double-coincidence (qBC) data product, which has roughly twice the detection rate of the qABC data product. This paper presents a simulation-based validation of the new qBC data product against the already-released qABC data product. The results show that the qBCs can plausibly be said to be measuring the same source population as the qABCs up to an average absolute deviation of 3.6%. Visual diagnostics provide additional confirmation of source rate coherence across data products. The framework introduced here is general and can be applied to other validation problems both within and outside the field of space physics.
False negative rates of severe acute respiratory coronavirus 2 diagnostic tests, together with selection bias due to prioritized testing can result in inaccurate modeling of COVID‐19 transmission ...dynamics based on reported “case” counts. We propose an extension of the widely used Susceptible‐Exposed‐Infected‐Removed (SEIR) model that accounts for misclassification error and selection bias, and derive an analytic expression for the basic reproduction number R0 as a function of false negative rates of the diagnostic tests and selection probabilities for getting tested. Analyzing data from the first two waves of the pandemic in India, we show that correcting for misclassification and selection leads to more accurate prediction in a test sample. We provide estimates of undetected infections and deaths between April 1, 2020 and August 31, 2021. At the end of the first wave in India, the estimated under‐reporting factor for cases was at 11.1 (95% CI: 10.7,11.5) and for deaths at 3.58 (95% CI: 3.5,3.66) as of February 1, 2021, while they change to 19.2 (95% CI: 17.9, 19.9) and 4.55 (95% CI: 4.32, 4.68) as of July 1, 2021. Equivalently, 9.0% (95% CI: 8.7%, 9.3%) and 5.2% (95% CI: 5.0%, 5.6%) of total estimated infections were reported on these two dates, while 27.9% (95% CI: 27.3%, 28.6%) and 22% (95% CI: 21.4%, 23.1%) of estimated total deaths were reported. Extensive simulation studies demonstrate the effect of misclassification and selection on estimation of R0 and prediction of future infections. A R‐package SEIRfansy is developed for broader dissemination.
Studies have shown an increased risk of severe SARS-CoV-2-related (COVID-19) disease outcome and mortality for patients with cancer, but it is not well understood whether associations vary by cancer ...site, cancer treatment, and vaccination status.
Using electronic health record data from an academic medical center, we identified a retrospective cohort of 260,757 individuals tested for or diagnosed with COVID-19 from March 10, 2020, to August 1, 2022. Of these, 52,019 tested positive for COVID-19 of whom 13,752 had a cancer diagnosis. We conducted Firth-corrected logistic regression to assess the association between cancer status, site, treatment, vaccination, and four COVID-19 outcomes: hospitalization, intensive care unit admission, mortality, and a composite "severe COVID" outcome.
Cancer diagnosis was significantly associated with higher rates of severe COVID, hospitalization, and mortality. These associations were driven by patients whose most recent initial cancer diagnosis was within the past 3 years. Chemotherapy receipt, colorectal cancer, hematologic malignancies, kidney cancer, and lung cancer were significantly associated with higher rates of worse COVID-19 outcomes. Vaccinations were significantly associated with lower rates of worse COVID-19 outcomes regardless of cancer status.
Patients with colorectal cancer, hematologic malignancies, kidney cancer, or lung cancer or who receive chemotherapy for treatment should be cautious because of their increased risk of worse COVID-19 outcomes, even after vaccination.
Additional COVID-19 precautions are warranted for people with certain cancer types and treatments. Significant benefit from vaccination is noted for both cancer and cancer-free patients.