Summary
The paper presents an incremental updating algorithm to analyse streaming data sets using generalized linear models. The method proposed is formulated within a new framework of renewable ...estimation and incremental inference, in which the maximum likelihood estimator is renewed with current data and summary statistics of historical data. Our framework can be implemented within a popular distributed computing environment, known as Apache Spark, to scale up computation. Consisting of two data‐processing layers, the rho architecture enables us to accommodate inference‐related statistics and to facilitate sequential updating of the statistics used in both estimation and inference. We establish estimation consistency and asymptotic normality of the proposed renewable estimator, in which the Wald test is utilized for an incremental inference. Our methods are examined and illustrated by various numerical examples from both simulation experiments and a real world data analysis.
Summary
Multi‐compartment models have been playing a central role in modelling infectious disease dynamics since the early 20th century. They are a class of mathematical models widely used for ...describing the mechanism of an evolving epidemic. Integrated with certain sampling schemes, such mechanistic models can be applied to analyse public health surveillance data, such as assessing the effectiveness of preventive measures (e.g. social distancing and quarantine) and forecasting disease spread patterns. This review begins with a nationwide macromechanistic model and related statistical analyses, including model specification, estimation, inference and prediction. Then, it presents a community‐level micromodel that enables high‐resolution analyses of regional surveillance data to provide current and future risk information useful for local government and residents to make decisions on reopenings of local business and personal travels. r software and scripts are provided whenever appropriate to illustrate the numerical detail of algorithms and calculations. The coronavirus disease 2019 pandemic surveillance data from the state of Michigan are used for the illustration throughout this paper.
For high-dimensional data sets with complicated dependency structures, the full likelihood approach often leads to intractable computational complexity. This imposes difficulty on model selection, ...given that most traditionally used information criteria require evaluation of the full likelihood. We propose a composite likelihood version of the Bayes information criterion (BIC) and establish its consistency property for the selection of the true underlying marginal model. Our proposed BIC is shown to be selection-consistent under some mild regularity conditions, where the number of potential model parameters is allowed to increase to infinity at a certain rate of the sample size. Simulation studies demonstrate the empirical performance of this new BIC, especially for the scenario where the number of parameters increases with sample size. Technical proofs of our theoretical results are provided in the online supplemental materials.
It is important to develop statistical techniques to analyze high-dimensional data in the presence of both complex dependence and possible heavy tails and outliers in real-world applications such as ...imaging data analyses. We propose a new robust high-dimensional regression with coefficient thresholding, in which an efficient nonconvex estimation procedure is proposed through a thresholding function and the robust Huber loss. The proposed regularization method accounts for complex dependence structures in predictors and is robust against heavy tails and outliers in outcomes. Theoretically, we rigorously analyze the landscape of the population and empirical risk functions for the proposed method. The fine landscape enables us to establish both statistical consistency and computational convergence under the high-dimensional setting. We also present an extension to incorporate spatial information into the proposed method. Finite-sample properties of the proposed methods are examined by extensive simulation studies. An application concerns a scalar-on-image regression analysis for an association of psychiatric disorder measured by the general factor of psychopathology with features extracted from the task functional MRI data in the Adolescent Brain Cognitive Development (ABCD) study.
Supplementary materials
for this article are available online.
Endocrine disrupting chemicals (EDCs) are ubiquitous, and pregnancy is a sensitive window for toxicant exposure. EDCs may disrupt the maternal immune system, which may lead to poor pregnancy ...outcomes. Most studies investigate single EDCs, even though "real life" exposures do not occur in isolation. We tested the hypothesis that uniquely weighted mixtures of early pregnancy exposures are associated with distinct changes in the maternal and neonatal inflammasome. First trimester urine samples were tested for 12 phthalates, 12 phenols, and 17 metals in 56 women. Twelve cytokines were measured in first trimester and term maternal plasma, and in cord blood after delivery. Spearman correlations and linear regression were used to relate individual exposures with inflammatory cytokines. Linear regression was used to relate cytokine levels with gestational age and birth weight. Principal component analysis was used to assess the effect of weighted EDC mixtures on maternal and neonatal inflammation. Our results demonstrated that maternal and cord blood cytokines were differentially associated with (1) individual EDCs and (2) EDC mixtures. Several individual cytokines were positively associated with gestational age and birth weight. These observed associations between EDC mixtures and the pregnancy inflammasome may have clinical and public health implications for women of childbearing age.
Data sharing barriers present paramount challenges arising from multicenter clinical studies where multiple data sources are stored and managed in a distributed fashion at different local study ...sites. Merging such data sources into a common data storage for a centralized statistical analysis requires a data use agreement, which is often time-consuming. Data merging may become more burdensome when propensity score modeling is involved in the analysis because combining many confounding variables, and systematic incorporation of this additional modeling in a meta-analysis has not been thoroughly investigated in the literature. Motivated from a multicenter clinical trial of basal insulin treatment for reducing the risk of post-transplantation diabetes mellitus, we propose a new inference framework that avoids the merging of subject-level raw data from multiple sites at a centralized facility but needs only the sharing of summary statistics. Unlike the architecture of federated learning, the proposed collaborative inference does not need a center site to combine local results and thus enjoys maximal protection of data privacy and minimal sensitivity to unbalanced data distributions across data sources. We show theoretically and numerically that the new distributed inference approach has little loss of statistical power compared to the centralized method that requires merging the entire data. We present large-sample properties and algorithms for the proposed method. We illustrate its performance by simulation experiments and the motivating example on the differential average treatment effect of basal insulin to lower risk of diabetes among kidney-transplant patients compared to the standard-of-care.
This article develops an incremental learning algorithm based on quadratic inference function (QIF) to analyze streaming datasets with correlated outcomes such as longitudinal data and clustered ...data. We propose a renewable QIF (RenewQIF) method within a paradigm of renewable estimation and incremental inference, in which parameter estimates are recursively renewed with current data and summary statistics of historical data, but with no use of any historical subject-level raw data. We compare our renewable estimation method with both offline QIF and offline generalized estimating equations (GEE) approach that process the entire cumulative subject-level data all together, and show theoretically and numerically that our renewable procedure enjoys statistical and computational efficiency. We also propose an approach to diagnose the homogeneity assumption of regression coefficients via a sequential goodness-of-fit test as a screening procedure on occurrences of abnormal data batches. We implement the proposed methodology by expanding existing Spark's Lambda architecture for the operation of statistical inference and data quality diagnosis. We illustrate the proposed methodology by extensive simulation studies and an analysis of streaming car crash datasets from the National Automotive Sampling System-Crashworthiness Data System (NASS CDS).
Supplementary materials
for this article are available online.
Varying Index Coefficient Models Ma, Shujie; Song, Peter X.-K.
Journal of the American Statistical Association,
03/2015, Letnik:
110, Številka:
509
Journal Article
Recenzirano
Odprti dostop
It has been a long history of using interactions in regression analysis to investigate alterations in covariate-effects on response variables. In this article, we aim to address two kinds of new ...challenges arising from the inclusion of such high-order effects in the regression model for complex data. The first kind concerns a situation where interaction effects of individual covariates are weak but those of combined covariates are strong, and the other kind pertains to the presence of nonlinear interactive effects directed by low-effect covariates. We propose a new class of semiparametric models with varying index coefficients, which enables us to model and assess nonlinear interaction effects between grouped covariates on the response variable. As a result, most of the existing semiparametric regression models are special cases of our proposed models. We develop a numerically stable and computationally fast estimation procedure using both profile least squares method and local fitting. We establish both estimation consistency and asymptotic normality for the proposed estimators of index coefficients as well as the oracle property for the nonparametric function estimator. In addition, a generalized likelihood ratio test is provided to test for the existence of interaction effects or the existence of nonlinear interaction effects. Our models and estimation methods are illustrated by simulation studies, and by an analysis of child growth data to evaluate alterations in growth rates incurred by mother's exposures to endocrine disrupting compounds during pregnancy. Supplementary materials for this article are available online.
Spatial‐clustered data refer to high‐dimensional correlated measurements collected from units or subjects that are spatially clustered. Such data arise frequently from studies in social and health ...sciences. We propose a unified modeling framework, termed as GeoCopula, to characterize both large‐scale variation, and small‐scale variation for various data types, including continuous data, binary data, and count data as special cases. To overcome challenges in the estimation and inference for the model parameters, we propose an efficient composite likelihood approach in that the estimation efficiency is resulted from a construction of over‐identified joint composite estimating equations. Consequently, the statistical theory for the proposed estimation is developed by extending the classical theory of the generalized method of moments. A clear advantage of the proposed estimation method is the computation feasibility. We conduct several simulation studies to assess the performance of the proposed models and estimation methods for both Gaussian and binary spatial‐clustered data. Results show a clear improvement on estimation efficiency over the conventional composite likelihood method. An illustrative data example is included to motivate and demonstrate the proposed method.
The purpose of this study was to identify individual and residency program factors associated with increased suicide risk, as measured by suicidal ideation. We utilized a prospective, longitudinal ...cohort study design to assess the prevalence and predictors of suicidal ideation in 6,691 (2012-2014 cohorts, training data set) and 4,904 (2015 cohort, test data set) first-year training physicians (interns) at hospital systems across the United States. We assessed suicidal ideation two months before internship and then quarterly through intern year. The prevalence of reported suicidal ideation in the study population increased from 3.0% at baseline to a mean of 6.9% during internship. 16.4% of interns reported suicidal ideation at least once during their internship. In the training dataset, a series of baseline demographic (male gender) and psychological factors (high neuroticism, depressive symptoms and suicidal ideation) were associated with increased risk of suicidal ideation during internship. Further, prior quarter psychiatric symptoms (depressive symptoms and suicidal ideation) and concurrent work-related factors (increase in self-reported work hours and medical errors) were associated with increased risk of suicidal ideation. A model derived from the training dataset had a predicted area under the Receiver Operating Characteristic curve (AUC) of 0.83 in the test dataset. The suicidal ideation risk predictors analyzed in this study can help programs and interns identify those at risk for suicidal ideation before the onset of training. Further, increases in self-reported work hours and environments associated with increased medical errors are potentially modifiable factors for residency programs to target to reduce suicide risk.