Statistical models of text have become increasingly popular in statistics and computer science as a method of exploring large document collections. Social scientists often want to move beyond ...exploration, to measurement and experimentation, and make inference about social and political processes that drive discourse and content. In this article, we develop a model of text data that supports this type of substantive research. Our approach is to posit a hierarchical mixed membership model for analyzing topical content of documents, in which mixing weights are parameterized by observed covariates. In this model, topical prevalence and topical content are specified as a simple generalized linear model on an arbitrary number of document-level covariates, such as news source and time of release, enabling researchers to introduce elements of the experimental design that informed document collection into the model, within a generally applicable framework. We demonstrate the proposed methodology by analyzing a collection of news reports about China, where we allow the prevalence of topics to evolve over time and vary across newswire services. Our methods quantify the effect of news wire source on both the frequency and nature of topic coverage. Supplementary materials for this article are available online.
Stochastic gradient descent procedures have gained popularity for parameter estimation from large data sets. However, their statistical properties are not well understood, in theory. And in practice, ...avoiding numerical instability requires careful tuning of key parameters. Here, we introduce implicit stochastic gradient descent procedures, which involve parameter updates that are implicitly defined. Intuitively, implicit updates shrink standard stochastic gradient descent updates. The amount of shrinkage depends on the observed Fisher information matrix, which does not need to be explicitly computed; thus, implicit procedures increase stability without increasing the computational burden. Our theoretical analysis provides the first full characterization of the asymptotic behavior of both standard and implicit stochastic gradient descent-based estimators, including finite-sample error bounds. Importantly, analytical expressions for the variances of these stochastic gradient-based estimators reveal their exact loss of efficiency. We also develop new algorithms to compute implicit stochastic gradient descent-based estimators for generalized linear models, Cox proportional hazards, M-estimators, in practice, and perform extensive experiments. Our results suggest that implicit stochastic gradient descent procedures are poised to become a workhorse for approximate inference from large data sets.
Abstract-Causal inference on a population of units connected through a network often presents technical challenges, including how to account for interference. In the presence of interference, for ...instance, potential outcomes of a unit depend on their treatment as well as on the treatments of other units, such as their neighbors in the network. In observational studies, a further complication is that the typical unconfoundedness assumption must be extended-say, to include the treatment of neighbors, and individual and neighborhood covariates-to guarantee identification and valid inference. Here, we propose new estimands that define treatment and interference effects. We then derive analytical expressions for the bias of a naive estimator that wrongly assumes away interference. The bias depends on the level of interference but also on the degree of association between individual and neighborhood treatments. We propose an extended unconfoundedness assumption that accounts for interference, and we develop new covariate-adjustment methods that lead to valid estimates of treatment and interference effects in observational studies on networks. Estimation is based on a generalized propensity score that balances individual and neighborhood covariates across units under different levels of individual treatment and of exposure to neighbors' treatment. We carry out simulations, calibrated using friendship networks and covariates in a nationally representative longitudinal study of adolescents in grades 7-12 in the United States, to explore finite-sample performance in different realistic settings.
Supplementary materials
for this article are available online.
Most real-world networks are incompletely observed. Algorithms that can accurately predict which links are missing can dramatically speed up network data collection and improve network model ...validation. Many algorithms now exist for predicting missing links, given a partially observed network, but it has remained unknown whether a single best predictor exists, how link predictability varies across methods and networks from different domains, and how close to optimality current methods are. We answer these questions by systematically evaluating 203 individual link predictor algorithms, representing three popular families of methods, applied to a large corpus of 550 structurally diverse networks from six scientific domains. We first show that individual algorithms exhibit a broad diversity of prediction errors, such that no one predictor or family is best, or worst, across all realistic inputs. We then exploit this diversity using network-based metalearning to construct a series of “stacked” models that combine predictors into a single algorithm. Applied to a broad range of synthetic networks, for which we may analytically calculate optimal performance, these stacked models achieve optimal or nearly optimal levels of accuracy. Applied to real-world networks, stacked models are superior, but their accuracy varies strongly by domain, suggesting that link prediction may be fundamentally easier in social networks than in biological or technological networks. These results indicate that the state of the art for link prediction comes fromcombining individual algorithms, which can achieve nearly optimal predictions. We close with a brief discussion of limitations and opportunities for further improvements.
An ongoing challenge in the analysis of document collections is how to summarize content in terms of a set of inferred themes that can be interpreted substantively in terms of topics. The current ...practice of parameterizing the themes in terms of most frequent words limits interpretability by ignoring the differential use of words across topics. Here, we show that words that are both frequent and exclusive to a theme are more effective at characterizing topical content, and we propose a regularization scheme that leads to better estimates of these quantities. We consider a supervised setting where professional editors have annotated documents to topic categories, organized into a tree, in which leaf-nodes correspond to more specific topics. Each document is annotated to multiple categories, at different levels of the tree. We introduce a hierarchical Poisson convolution model to analyze these annotated documents. A parallelized Hamiltonian Monte Carlo sampler allows the inference to scale to millions of documents. The model leverages the structure among categories defined by professional editors to infer a clear semantic description for each topic in terms of words that are both frequent and exclusive. In this supervised setting, we validate the efficacy of word frequency and exclusivity at characterizing topical content on two very large collections of documents, from Reuters and the New York Times. In an unsupervised setting, we then consider a simplified version of the model that shares the same regularization scheme with the previous model. We carry out a large randomized experiment on Amazon Mechanical Turk to demonstrate that topic summaries based on frequency and exclusivity, estimated using the proposed regularization scheme, are more interpretable than currently established frequency-based summaries, and that the proposed model produces more efficient estimates of exclusivity than the currently established models.
Social networks affect many aspects of life, including the spread of diseases, the diffusion of information, the workers' productivity, and consumers' behavior. Little is known, however, about how ...these networks form and change. Estimating causal effects and mechanisms that drive social network formation and dynamics is challenging because of the complexity of engineering social relations in a controlled environment, endogeneity between network structure and individual characteristics, and the lack of time-resolved data about individuals' behavior. We leverage data from a sample of 1.5 million college students on Facebook, who wrote more than 630 million messages and 590 million posts over 4 years, to design a long-term natural experiment of friendship formation and social dynamics in the aftermath of a natural disaster. The analysis shows that affected individuals are more likely to strengthen interactions, while maintaining the same number of friends as unaffected individuals. Our findings suggest that the formation of social relationships may serve as a coping mechanism to deal with high-stress situations and build resilience in communities.
Significance This paper presents an empirical analysis of the short- and long-term causal effects of a hurricane on social structure. Establishing causal relationships in social network formation and dynamics has historically been difficult because of the complexity of engineering social relations in a controlled environment, and the lack of time-resolved data about individuals' behavior. In addition, large-scale interventions of network structure are not feasible in practice. Here, we design an observational study that enables the estimation of causal effects by leveraging the locally well-defined impact of a hurricane. This aspect allows us to conceptualize the analysis of individuals’ behavior as a natural experiment, where the intervention is randomized by nature to locales, leaving only issues of balance to consider.
Public transportation systems are an essential component of major cities. The widespread use of smart cards for automated fare collection in these systems offers a unique opportunity to understand ...passenger behavior at a massive scale. In this study, we use network-wide data obtained from smart cards in the London transport system to predict future traffic volumes, and to estimate the effects of disruptions due to unplanned closures of stations or lines. Disruptions, or shocks, force passengers to make different decisions concerning which stations to enter or exit. We describe how these changes in passenger behavior lead to possible overcrowding and model how stations will be affected by given disruptions. This information can then be used to mitigate the effects of these shocks because transport authorities may prepare in advance alternative solutions such as additional buses near the most affected stations. We describe statistical methods that leverage the large amount of smart-card data collected under the natural state of the system, where no shocks take place, as variables that are indicative of behavior under disruptions. We find that features extracted from the natural regime data can be successfully exploited to describe different disruption regimes, and that our framework can be used as a general tool for any similar complex transportation system.
Significance We propose a new approach to analyzing massive transportation systems that leverages traffic information about individual travelers. The goals of the analysis are to quantify the effects of shocks in the system, such as line and station closures, and to predict traffic volumes. We conduct an in-depth statistical analysis of the Transport for London railway traffic system. The proposed methodology is unique in the way that past disruptions are used to predict unseen scenarios, by relying on simple physical assumptions of passenger flow and a system-wide model for origin–destination movement. The method is scalable, more accurate than blackbox approaches, and generalizable to other complex transportation systems. It therefore offers important insights to inform policies on urban transportation.
Heat causes protein misfolding and aggregation and, in eukaryotic cells, triggers aggregation of proteins and RNA into stress granules. We have carried out extensive proteomic studies to quantify ...heat-triggered aggregation and subsequent disaggregation in budding yeast, identifying >170 endogenous proteins aggregating within minutes of heat shock in multiple subcellular compartments. We demonstrate that these aggregated proteins are not misfolded and destined for degradation. Stable-isotope labeling reveals that even severely aggregated endogenous proteins are disaggregated without degradation during recovery from shock, contrasting with the rapid degradation observed for many exogenous thermolabile proteins. Although aggregation likely inactivates many cellular proteins, in the case of a heterotrimeric aminoacyl-tRNA synthetase complex, the aggregated proteins remain active with unaltered fidelity. We propose that most heat-induced aggregation of mature proteins reflects the operation of an adaptive, autoregulatory process of functionally significant aggregate assembly and disassembly that aids cellular adaptation to thermal stress.
Display omitted
•Mass spectrometry quantifies aggregation of endogenous proteins during heat stress•Aggregates form rapidly in specific subcellular compartments•Endogenous protein aggregates are disassembled without degradation during recovery•In vitro, a heat-aggregated enzyme complex retains activity and fidelity
The aggregates of endogenous proteins triggered by heat stress in yeast are reversible. Rather than representing irreparably misfolded proteins destined for degradation, they can maintain activity and re-solubilize, suggesting an adaptive strategy underlying aggregation.
Despite its eponymous association with the heat shock response, yeast heat shock factor 1 (Hsf1) is essential even at low temperatures. Here we show that engineered nuclear export of Hsf1 results in ...cytotoxicity associated with massive protein aggregation. Genome-wide analysis revealed that Hsf1 nuclear export immediately decreased basal transcription and mRNA expression of 18 genes, which predominately encode chaperones. Strikingly, rescuing basal expression of Hsp70 and Hsp90 chaperones enabled robust cell growth in the complete absence of Hsf1. With the exception of chaperone gene induction, the vast majority of the heat shock response was Hsf1 independent. By comparative analysis of mammalian cell lines, we found that only heat shock-induced but not basal expression of chaperones is dependent on the mammalian Hsf1 homolog (HSF1). Our work reveals that yeast chaperone gene expression is an essential housekeeping mechanism and provides a roadmap for defining the function of HSF1 as a driver of oncogenesis.
Display omitted
•Yeast Hsf1 prevents toxic protein aggregation even in the absence of heat stress•Basal and heat-induced expression of a small chaperone network is Hsf1 dependent•Mammalian Hsf1 drives chaperone gene expression during heat stress only•Rescuing Hsp70 and Hsp90 expression prevents cell toxicity due to Hsf1 ablation
Using a chemical genetics approach for inactivating yeast Hsf1, Solís et al. find that gene expression of a compact chaperone network is minimally sufficient to stave off a lethal proteostasis collapse. Yeast Hsf1 controls both basal and heat stress-induced chaperone expression, whereas mammalian Hsf1 controls only the latter.