Statistical learning and selective inference Taylor, Jonathan; Robert J. Tibshirani
Proceedings of the National Academy of Sciences - PNAS,
06/2015, Letnik:
112, Številka:
25
Journal Article
Recenzirano
Odprti dostop
We describe the problem of âselective inference.â This addresses the following challenge: Having mined a set of data to find potential associations, how do we properly assess the strength of ...these associations? The fact that we have âcherry-pickedââsearched for the strongest associationsâmeans that we must set a higher bar for declaring significant the associations that we see. This challenge becomes more important in the era of big data and complex statistical modeling. The cherry tree (dataset) can be very large and the tools for cherry picking (statistical learning methods) are now very sophisticated. We describe some recent new developments in selective inference and illustrate their use in forward stepwise regression, the lasso, and principal components analysis.
Significance Most statistical analyses involve some kind of âselectionââsearching through the data for the strongest associations. Measuring the strength of the resulting associations is a challenging task, because one must account for the effects of the selection. There are some new tools in selective inference for this task, and we illustrate their use in forward stepwise regression, the lasso, and principal components analysis.
Rheumatoid arthritis (RA) is a prototypical autoimmune arthritis affecting nearly 1% of the world population and is a significant cause of worldwide disability. Though prior studies have demonstrated ...the appearance of RA-related autoantibodies years before the onset of clinical RA, the pattern of immunologic events preceding the development of RA remains unclear. To characterize the evolution of the autoantibody response in the preclinical phase of RA, we used a novel multiplex autoantigen array to evaluate development of the anti-citrullinated protein antibodies (ACPA) and to determine if epitope spread correlates with rise in serum cytokines and imminent onset of clinical RA. To do so, we utilized a cohort of 81 patients with clinical RA for whom stored serum was available from 1-12 years prior to disease onset. We evaluated the accumulation of ACPA subtypes over time and correlated this accumulation with elevations in serum cytokines. We then used logistic regression to identify a profile of biomarkers which predicts the imminent onset of clinical RA (defined as within 2 years of testing). We observed a time-dependent expansion of ACPA specificity with the number of ACPA subtypes. At the earliest timepoints, we found autoantibodies targeting several innate immune ligands including citrullinated histones, fibrinogen, and biglycan, thus providing insights into the earliest autoantigen targets and potential mechanisms underlying the onset and development of autoimmunity in RA. Additionally, expansion of the ACPA response strongly predicted elevations in many inflammatory cytokines including TNF-α, IL-6, IL-12p70, and IFN-γ. Thus, we observe that the preclinical phase of RA is characterized by an accumulation of multiple autoantibody specificities reflecting the process of epitope spread. Epitope expansion is closely correlated with the appearance of preclinical inflammation, and we identify a biomarker profile including autoantibodies and cytokines which predicts the imminent onset of clinical arthritis.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Elucidation and examination of cellular subpopulations that display condition-specific behavior can play a critical contributory role in understanding disease mechanism, as well as provide a focal ...point for development of diagnostic criteria linking such a mechanism to clinical prognosis. Despite recent advancements in single-cell measurement technologies, the identification of relevant cell subsets through manual efforts remains standard practice. As new technologies such as mass cytometry increase the parameterization of single-cell measurements, the scalability and subjectivity inherent in manual analyses slows both analysis and progress. We therefore developed Citrus (cluster identification, characterization, and regression), a data-driven approach for the identification of stratifying subpopulations in multidimensional cytometry datasets. The methodology of Citrus is demonstrated through the identification of known and unexpected pathway responses in a dataset of stimulated peripheral blood mononuclear cells measured by mass cytometry. Additionally, the performance of Citrus is compared with that of existing methods through the analysis of several publicly available datasets. As the complexity of flow cytometry datasets continues to increase, methods such as Citrus will be needed to aid investigators in the performance of unbiased—and potentially more thorough—correlation-based mining and inspection of cell subsets nested within high-dimensional datasets.
We suggest a method for estimating a covariance matrix on the basis of a sample of vectors drawn from a multivariate normal distribution. In particular, we penalize the likelihood with a lasso ...penalty on the entries of the covariance matrix. This penalty plays two important roles: it reduces the effective number of parameters, which is important even when the dimension of the vectors is smaller than the sample size since the number of parameters grows quadratically in the number of variables, and it produces an estimate which is sparse. In contrast to sparse inverse covariance estimation, our method's close relative, the sparsity attained here is in the covariance matrix itself rather than in the inverse matrix. Zeros in the covariance matrix correspond to marginal independencies; thus, our method performs model selection while providing a positive definite estimate of the covariance. The proposed penalized maximum likelihood problem is not convex, so we use a majorize-minimize approach in which we iteratively solve convex approximations to the original nonconvex problem. We discuss tuning parameter selection and demonstrate on a flow-cytometry dataset how our method produces an interpretable graphical display of the relationship between variables. We perform simulations that suggest that simple elementwise thresholding of the empirical covariance matrix is competitive with our method for identifying the sparsity structure. Additionally, we show how our method can be used to solve a previously studied special case in which a desired sparsity pattern is prespecified.
We study the problem of treatment effect estimation in randomized experiments with high-dimensional covariate information and show that essentially any risk-consistent regression adjustment can be ...used to obtain efficient estimates of the average treatment effect. Our results considerably extend the range of settings where high-dimensional regression adjustments are guaranteed to provide valid inference about the population average treatment effect. We then propose cross-estimation, a simple method for obtaining finite-sample–unbiased treatment effect estimates that leverages high-dimensional regression adjustments. Our method can be used when the regression model is estimated using the lasso, the elastic net, subset selection, etc. Finally, we extend our analysis to allow for adaptive specification search via cross-validation and flexible nonparametric regression adjustments with machine-learning methods such as random forests or neural networks.
Females have generally more robust immune responses than males for reasons that are not well-understood. Here we used a systems analysis to investigate these differences by analyzing the neutralizing ...antibody response to a trivalent inactivated seasonal influenza vaccine (TIV) and a large number of immune system components, including serum cytokines and chemokines, blood cell subset frequencies, genome-wide gene expression, and cellular responses to diverse in vitro stimuli, in 53 females and 34 males of different ages. We found elevated antibody responses to TIV and expression of inflammatory cytokines in the serum of females compared with males regardless of age. This inflammatory profile correlated with the levels of phosphorylated STAT3 proteins in monocytes but not with the serological response to the vaccine. In contrast, using a machine learning approach, we identified a cluster of genes involved in lipid biosynthesis and previously shown to be up-regulated by testosterone that correlated with poor virus-neutralizing activity in men. Moreover, men with elevated serum testosterone levels and associated gene signatures exhibited the lowest antibody responses to TIV. These results demonstrate a strong association between androgens and genes involved in lipid metabolism, suggesting that these could be important drivers of the differences in immune responses between males and females.
Accurate identification of prostate cancer in frozen sections at the time of surgery can be challenging, limiting the surgeon’s ability to best determine resection margins during prostatectomy. We ...performed desorption electrospray ionization mass spectrometry imaging (DESI-MSI) on 54 banked human cancerous and normal prostate tissue specimens to investigate the spatial distribution of a wide variety of small metabolites, carbohydrates, and lipids. In contrast to several previous studies, our method included Krebs cycle intermediates (m/z <200), which we found to be highly informative in distinguishing cancer from benign tissue. Malignant prostate cells showed marked metabolic derangements compared with their benign counterparts. Using the “Least absolute shrinkage and selection operator” (Lasso), we analyzed all metabolites from the DESI-MS data and identified parsimonious sets of metabolic profiles for distinguishing between cancer and normal tissue. In an independent set of samples, we could use these models to classify prostate cancer from benign specimens with nearly 90% accuracy per patient. Based on previous work in prostate cancer showing that glucose levels are high while citrate is low, we found that measurement of the glucose/citrate ion signal ratio accurately predicted cancer when this ratio exceeds 1.0 and normal prostate when the ratio is less than 0.5. After brief tissue preparation, the glucose/citrate ratio can be recorded on a tissue sample in 1 min or less, which is in sharp contrast to the 20 min or more required by histopathological examination of frozen tissue specimens.
In this study of patients in Ontario, Canada, who received a medical warning from a physician because they were judged to be potentially unfit to drive, warnings were associated with reductions in ...emergency department visits for road crashes.
Physicians' warnings to patients who are potentially unfit to drive are a medical intervention intended to prevent trauma from motor vehicle crashes. Advocates point out the similarity to physicians' warnings with regard to communicable infections, arguing that formal warnings are needed because dangerous driving imposes risks on others.
1
However, formal warnings may reduce the patient's quality of life, jeopardize doctor–patient relationships, burden family members, and generate bureaucratic hassles.
2
Many small studies offer conflicting conclusions on the effectiveness of physicians' warnings to patients who are potentially unfit to drive.
3
–
6
Different regions, therefore, have different policies for medical warnings to drivers. . . .
Surgical resection is the main curative option for gastrointestinal cancers. The extent of cancer resection is commonly assessed during surgery by pathologic evaluation of (frozen sections of) the ...tissue at the resected specimen margin(s) to verify whether cancer is present. We compare this method to an alternative procedure, desorption electrospray ionization mass spectrometric imaging (DESI-MSI), for 62 banked human cancerous and normal gastric-tissue samples. In DESI-MSI, microdroplets strike the tissue sample, the resulting splash enters a mass spectrometer, and a statistical analysis, here, the Lasso method (which stands for least absolute shrinkage and selection operator and which is a multiclass logistic regression with L1 penalty), is applied to classify tissues based on the molecular information obtained directly from DESI-MSI. The methodology developed with 28 frozen training samples of clear histopathologic diagnosis showed an overall accuracy value of 98% for the 12,480 pixels evaluated in cross-validation (CV), and 97% when a completely independent set of samples was tested. By applying an additional spatial smoothing technique, the accuracy for both CV and the independent set of samples was 99% compared with histological diagnoses. To test our method for clinical use, we applied it to a total of 21 tissue-margin samples prospectively obtained from nine gastric-cancer patients. The results obtained suggest that DESI-MSI/Lasso may be valuable for routine intraoperative assessment of the specimen margins during gastric-cancer surgery.
Detection of microscopic skin lesions presents a considerable challenge in diagnosing early-stage malignancies as well as in residual tumor interrogation after surgical intervention. In this study, ...we established the capability of desorption electrospray ionization mass spectrometry imaging (DESI-MSI) to distinguish between micrometer-sized tumor aggregates of basal cell carcinoma (BCC), a common skin cancer, and normal human skin. We analyzed 86 human specimens collected during Mohs micrographic surgery for BCC to cross-examine spatial distributions of numerous lipids and metabolites in BCC aggregates versus adjacent skin. Statistical analysis using the least absolute shrinkage and selection operation (Lasso) was employed to categorize each 200-μm-diameter picture element (pixel) of investigated skin tissue map as BCC or normal. Lasso identified 24 molecular ion signals, which are significant for pixel classification. These ion signals included lipids observed at m/z 200–1,200 and Krebs cycle metabolites observed at m/z < 200. Based on these features, Lasso yielded an overall 94.1% diagnostic accuracy pixel by pixel of the skin map compared with histopathological evaluation. We suggest that DESI-MSI/Lasso analysis can be employed as a complementary technique for delineation of microscopic skin tumors.