Liquid chromatography coupled with bottom-up mass spectrometry (LC-MS/MS)–based proteomics is increasingly used to detect changes in posttranslational modifications (PTMs) in samples from different ...conditions. Analysis of data from such experiments faces numerous statistical challenges. These include the low abundance of modified proteoforms, the small number of observed peptides that span modification sites, and confounding between changes in the abundance of PTM and the overall changes in the protein abundance. Therefore, statistical approaches for detecting differential PTM abundance must integrate all the available information pertaining to a PTM site and consider all the relevant sources of confounding and variation. In this manuscript, we propose such a statistical framework, which is versatile, accurate, and leads to reproducible results. The framework requires an experimental design, which quantifies, for each sample, both peptides with PTMs and peptides from the same proteins with no modification sites. The proposed framework supports both label-free and tandem mass tag-based LC-MS/MS acquisitions. The statistical methodology separately summarizes the abundances of peptides with and without the modification sites, by fitting separate linear mixed effects models appropriate for the experimental design. Next, model-based inferences regarding the PTM and the protein-level abundances are combined to account for the confounding between these two sources. Evaluations on computer simulations, a spike-in experiment with known ground truth, and three biological experiments with different organisms, modification types, and data acquisition types demonstrate the improved fold change estimation and detection of differential PTM abundance, as compared to currently used approaches. The proposed framework is implemented in the free and open-source R/Bioconductor package MSstatsPTM.
Display omitted
•MSstatsPTM differentiates statistical changes in PTMs and global protein abundances.•Novel methodology removes confounding between the modification and global protein.•MSstatsPTM is an open-source R package available on Bioconductor.•The package is applicable to a variety of experimental designs and labeling methods.•Benchmarked on simulated data, a spike-in mixture, and biological investigations.
MSstatsPTM detects statistical differences between posttranslational modifications and global protein abundances. It includes novel methodology to remove confounding between the effect of individual modification versus the global protein. MSstatsPTM is implemented as an open-source R package available on Bioconductor. The package can be applied to a variety of experimental designs, including those which are label-free and label-based. The package has been benchmarked on several datasets including simulated datasets, a spike-in controlled investigation, and multiple biological experiments.
Abstract
Motivation
Mass spectrometry imaging (MSI) characterizes the molecular composition of tissues at spatial resolution, and has a strong potential for distinguishing tissue types, or disease ...states. This can be achieved by supervised classification, which takes as input MSI spectra, and assigns class labels to subtissue locations. Unfortunately, developing such classifiers is hindered by the limited availability of training sets with subtissue labels as the ground truth. Subtissue labeling is prohibitively expensive, and only rough annotations of the entire tissues are typically available. Classifiers trained on data with approximate labels have sub-optimal performance.
Results
To alleviate this challenge, we contribute a semi-supervised approach mi-CNN. mi-CNN implements multiple instance learning with a convolutional neural network (CNN). The multiple instance aspect enables weak supervision from tissue-level annotations when classifying subtissue locations. The convolutional architecture of the CNN captures contextual dependencies between the spectral features. Evaluations on simulated and experimental datasets demonstrated that mi-CNN improved the subtissue classification as compared to traditional classifiers. We propose mi-CNN as an important step toward accurate subtissue classification in MSI, enabling rapid distinction between tissue types and disease states.
Availability and implementation
The data and code are available at https://github.com/Vitek-Lab/mi-CNN_MSI.
Liquid chromatography coupled with bottom-up mass spectrometry (LC-MS/MS)-based proteomics is a versatile technology for identifying and quantifying proteins in complex biological mixtures. ...Postidentification, analysis of changes in protein abundances between conditions requires increasingly complex and specialized statistical methods. Many of these methods, in particular the family of open-source Bioconductor packages MSstats, are implemented in a coding language such as R. To make the methods in MSstats accessible to users with limited programming and statistical background, we have created MSstatsShiny, an R-Shiny graphical user interface (GUI) integrated with MSstats, MSstatsTMT, and MSstatsPTM. The GUI provides a point and click analysis pipeline applicable to a wide variety of proteomics experimental types, including label-free data-dependent acquisitions (DDAs) or data-independent acquisitions (DIAs), or tandem mass tag (TMT)-based TMT-DDAs, answering questions such as relative changes in the abundance of peptides, proteins, or post-translational modifications (PTMs). To support reproducible research, the application saves user’s selections and builds an R script that programmatically recreates the analysis. MSstatsShiny can be installed locally via Github and Bioconductor, or utilized on the cloud at www.msstatsshiny.com. We illustrate the utility of the platform using two experimental data sets (MassIVE IDs MSV000086623 and MSV000085565).
Causal inference, the task of uncovering regulatory relationships between components of biomolecular pathways and networks, is a primary goal of many high-throughput investigations. Statistical ...associations between observed protein concentrations can suggest an enticing number of hypotheses regarding the underlying causal interactions, but when do such associations reflect the underlying causal biomolecular mechanisms? The goal of this perspective is to provide suggestions for causal inference in large-scale experiments, which utilize high-throughput technologies such as mass-spectrometry-based proteomics. We describe in nontechnical terms the pitfalls of inference in large data sets and suggest methods to overcome these pitfalls and reliably find regulatory associations.
Proteins regulate biological processes by changing their structure or abundance to accomplish a specific function. In response to a perturbation, protein structure may be altered by various molecular ...events, such as post-translational modifications, protein-protein interactions, aggregation, allostery or binding to other molecules. The ability to probe these structural changes in thousands of proteins simultaneously in cells or tissues can provide valuable information about the functional state of biological processes and pathways. Here, we present an updated protocol for LiP-MS, a proteomics technique combining limited proteolysis with mass spectrometry, to detect protein structural alterations in complex backgrounds and on a proteome-wide scale. In LiP-MS, proteins undergo a brief proteolysis in native conditions followed by complete digestion in denaturing conditions, to generate structurally informative proteolytic fragments that are analyzed by mass spectrometry. We describe advances in the throughput and robustness of the LiP-MS workflow and implementation of data-independent acquisition-based mass spectrometry, which together achieve high reproducibility and sensitivity, even on large sample sizes. We introduce MSstatsLiP, an R package dedicated to the analysis of LiP-MS data for the identification of structurally altered peptides and differentially abundant proteins. The experimental procedures take 3 d, mass spectrometric measurement time and data processing depend on sample number and statistical analysis typically requires ~1 d. These improvements expand the adaptability of LiP-MS and enable wide use in functional proteomics and translational applications.
We introduce matter , an R package for direct interactions with larger-than-memory datasets, stored in an arbitrary number of files of any size. matter is primarily designed for datasets in new and ...rapidly evolving file formats, which may lack extensive software support. matter enables a wide variety of data exploration and manipulation steps, and is extensible to many bioinformatics applications. It supports reproducible research by minimizing the need of converting and storing data in multiple formats. We illustrate the performance of matter in conjunction with the Bioconductor package Cardinal for analysis of high-resolution, high-throughput mass spectrometry imaging experiments.
The package, vignettes, and examples of applications in several areas of bioinformatics are available open-source at www.bioconductor.org under the Artistic-2.0 license.
o.vitek@neu.edu.
Abstract
Causal query estimation in biomolecular networks commonly selects a ‘valid adjustment set’, i.e. a subset of network variables that eliminates the bias of the estimator. A same query may ...have multiple valid adjustment sets, each with a different variance. When networks are partially observed, current methods use graph-based criteria to find an adjustment set that minimizes asymptotic variance. Unfortunately, many models that share the same graph topology, and therefore same functional dependencies, may differ in the processes that generate the observational data. In these cases, the topology-based criteria fail to distinguish the variances of the adjustment sets. This deficiency can lead to sub-optimal adjustment sets, and to miss-characterization of the effect of the intervention. We propose an approach for deriving ‘optimal adjustment sets’ that takes into account the nature of the data, bias and finite-sample variance of the estimator, and cost. It empirically learns the data generating processes from historical experimental data, and characterizes the properties of the estimators by simulation. We demonstrate the utility of the proposed approach in four biomolecular Case studies with different topologies and different data generation processes. The implementation and reproducible Case studies are at https://github.com/srtaheri/OptimalAdjustmentSet.
The MSstats R-Bioconductor family of packages is widely used for statistical analyses of quantitative bottom-up mass spectrometry-based proteomic experiments to detect differentially abundant ...proteins. It is applicable to a variety of experimental designs and data acquisition strategies and is compatible with many data processing tools used to identify and quantify spectral features. In the face of ever-increasing complexities of experiments and data processing strategies, the core package of the family, with the same name MSstats, has undergone a series of substantial updates. Its new version MSstats v4.0 improves the usability, versatility, and accuracy of statistical methodology, and the usage of computational resources. New converters integrate the output of upstream processing tools directly with MSstats, requiring less manual work by the user. The package’s statistical models have been updated to a more robust workflow. Finally, MSstats’ code has been substantially refactored to improve memory use and computation speed. Here we detail these updates, highlighting methodological differences between the new and old versions. An empirical comparison of MSstats v4.0 to its previous implementations, as well as to the packages MSqRob and DEqMS, on controlled mixtures and biological experiments demonstrated a stronger performance and better usability of MSstats v4.0 as compared to existing methods.
Summaries of protein abundance are undermined by mass spectrometric features inconsistent with the overall protein profile. We develop an automated statistical approach that detects such inconsistent ...features. These features can be separately investigated, curated, or removed. Accounting for such inconsistent features improves the estimation and the detection of differential protein abundance between conditions. The approach is implemented as an option in the open-source R-based software MSstats.
Display omitted
Highlights
•Automated statistical approach for detecting uninformative features and outliers.•Improved performance on relative protein quantification.•An option in the open-source R-based software MSstats.
In bottom-up mass spectrometry-based proteomics, relative protein quantification is often achieved with data-dependent acquisition (DDA), data-independent acquisition (DIA), or selected reaction monitoring (SRM). These workflows quantify proteins by summarizing the abundances of all the spectral features of the protein (e.g. precursor ions, transitions or fragments) in a single value per protein per run. When abundances of some features are inconsistent with the overall protein profile (for technological reasons such as interferences, or for biological reasons such as post-translational modifications), the protein-level summaries and the downstream conclusions are undermined. We propose a statistical approach that automatically detects spectral features with such inconsistent patterns. The detected features can be separately investigated, and if necessary, removed from the data set. We evaluated the proposed approach on a series of benchmark-controlled mixtures and biological investigations with DDA, DIA and SRM data acquisitions. The results demonstrated that it could facilitate and complement manual curation of the data. Moreover, it can improve the estimation accuracy, sensitivity and specificity of detecting differentially abundant proteins, and reproducibility of conclusions across different data processing tools. The approach is implemented as an option in the open-source R-based software MSstats.