Proteogenomics is an area of research at the interface of proteomics and genomics. In this approach, customized protein sequence databases generated using genomic and transcriptomic information are ...used to help identify novel peptides (not present in reference protein sequence databases) from mass spectrometry-based proteomic data; in turn, the proteomic data can be used to provide protein-level evidence of gene expression and to help refine gene models. In recent years, owing to the emergence of new sequencing technologies such as RNA-seq and dramatic improvements in the depth and throughput of mass spectrometry-based proteomics, the pace of proteogenomic research has greatly accelerated. Here I review the current state of proteogenomic methods and applications, including computational strategies for building and using customized protein sequence databases. I also draw attention to the challenge of false positive identifications in proteogenomics and provide guidelines for analyzing the data and reporting the results of proteogenomic studies.
Missing values weaken the power of label-free quantitative proteomic experiments to uncover true quantitative differences between biological samples or experimental conditions. Match-between-runs ...(MBR) has become a common approach to mitigate the missing value problem, where peptides identified by tandem mass spectra in one run are transferred to another by inference based on m/z, charge state, retention time, and ion mobility when applicable. Though tolerances are used to ensure such transferred identifications are reasonably located and meet certain quality thresholds, little work has been done to evaluate the statistical confidence of MBR. Here, we present a mixture model-based approach to estimate the false discovery rate (FDR) of peptide and protein identification transfer, which we implement in the label-free quantification tool IonQuant. Using several benchmarking datasets generated on both Orbitrap and timsTOF mass spectrometers, we demonstrate superior performance of IonQuant with FDR-controlled MBR compared with MaxQuant (19–38 times faster; 6–18% more proteins quantified and with comparable or better accuracy). We further illustrate the performance of IonQuant and highlight the need for FDR-controlled MBR, in two single-cell proteomics experiments, including one acquired with the help of high-field asymmetric ion mobility spectrometry separation. Fully integrated in the FragPipe computational environment, IonQuant with FDR-controlled MBR enables fast and accurate peptide and protein quantification in label-free proteomics experiments.
Display omitted
•A mixture-model approach controls the false discovery rate of match-between-runs.•The method is implemented in IonQuant.•Experiments with various data types show high sensitivity and accuracy of IonQuant.
Match-between-runs is a powerful approach to mitigate the missing value problem in label-free quantification. It transfers features identified by MS/MS from one run to the other, but previously, there was no false discovery rate control over this process. We present a mixture model–based approach to estimate and control the false discovery rate, which we have implemented in IonQuant. We demonstrate the sensitivity, accuracy, and speed of IonQuant using proteomics data from timsTOF, Orbitrap, and Orbitrap coupled to FAIMS.
This manuscript provides a comprehensive review of the peptide and protein identification process using tandem mass spectrometry (MS/MS) data generated in shotgun proteomic experiments. The commonly ...used methods for assigning peptide sequences to MS/MS spectra are critically discussed and compared, from basic strategies to advanced multi-stage approaches. A particular attention is paid to the problem of false-positive identifications. Existing statistical approaches for assessing the significance of peptide to spectrum matches are surveyed, ranging from single-spectrum approaches such as expectation values to global error rate estimation procedures such as false discovery rates and posterior probabilities. The importance of using auxiliary discriminant information (mass accuracy, peptide separation coordinates, digestion properties, and etc.) is discussed, and advanced computational approaches for joint modeling of multiple sources of information are presented. This review also includes a detailed analysis of the issues affecting the interpretation of data at the protein level, including the amplification of error rates when going from peptide to protein level, and the ambiguities in inferring the identifies of sample proteins in the presence of shared peptides. Commonly used methods for computing protein-level confidence scores are discussed in detail. The review concludes with a discussion of several outstanding computational issues.
The dia-PASEF technology uses ion mobility separation to reduce signal interferences and increase sensitivity in proteomic experiments. Here we present a two-dimensional peak-picking algorithm and ...generation of optimized spectral libraries, as well as take advantage of neural network-based processing of dia-PASEF data. Our computational platform boosts proteomic depth by up to 83% compared to previous work, and is specifically beneficial for fast proteomic experiments and those with low sample amounts. It quantifies over 5300 proteins in single injections recorded at 200 samples per day throughput using Evosep One chromatography system on a timsTOF Pro mass spectrometer and almost 9000 proteins in single injections recorded with a 93-min nanoflow gradient on timsTOF Pro 2, from 200 ng of HeLa peptides. A user-friendly implementation is provided through the incorporation of the algorithms in the DIA-NN software and by the FragPipe workflow for spectral library generation.
Deisotoping, or the process of removing peaks in a mass spectrum resulting from the incorporation of naturally occurring heavy isotopes, has long been used to reduce complexity and improve the ...effectiveness of spectral annotation methods in proteomics. We have previously described MSFragger, an ultrafast search engine for proteomics, that did not utilize deisotoping in processing input spectra. Here, we present a new, high-speed parallelized deisotoping algorithm, based on elements of several existing methods, that we have incorporated into the MSFragger search engine. Applying deisotoping with MSFragger reveals substantial improvements to database search speed and performance, particularly for complex methods like open or nonspecific searches. Finally, we evaluate our deisotoping method on data from several instrument types and vendors, revealing a wide range in performance and offering an updated perspective on deisotoping in the modern proteomics environment.
There is a need to better understand and handle the 'dark matter' of proteomics-the vast diversity of post-translational and chemical modifications that are unaccounted in a typical mass ...spectrometry-based analysis and thus remain unidentified. We present a fragment-ion indexing method, and its implementation in peptide identification tool MSFragger, that enables a more than 100-fold improvement in speed over most existing proteome database search tools. Using several large proteomic data sets, we demonstrate how MSFragger empowers the open database search concept for comprehensive identification of peptides and all their modified forms, uncovering dramatic differences in modification rates across experimental samples and conditions. We further illustrate its utility using protein-RNA cross-linked peptide data and using affinity purification experiments where we observe, on average, a 300% increase in the number of identified spectra for enriched proteins. We also discuss the benefits of open searching for improved false discovery rate estimation in proteomics.
Glycoproteomics, or characterizing glycosylation events at a proteome scale, has seen rapid advances in methods for analyzing glycopeptides by tandem mass spectrometry in recent years. These advances ...have enabled acquisition of far more comprehensive and large-scale datasets, precipitating an urgent need for improved informatics methods to analyze the resulting data. A new generation of glycoproteomics search methods has recently emerged, using glycan fragmentation to split the identification of a glycopeptide into peptide and glycan components and solve each component separately. In this review, we discuss these new methods and their implications for large-scale glycoproteomics, as well as several outstanding challenges in glycoproteomics data analysis, including validation of glycan assignments and quantitation. Finally, we provide an outlook on the future of glycoproteomics from an informatics perspective, noting the key challenges to achieving widespread and reproducible glycopeptide annotation and quantitation.
As a result of recent improvements in mass spectrometry (MS), there is increased interest in data-independent acquisition (DIA) strategies in which all peptides are systematically fragmented using ...wide mass-isolation windows ('multiplex fragmentation'). DIA-Umpire (http://diaumpire.sourceforge.net/), a comprehensive computational workflow and open-source software for DIA data, detects precursor and fragment chromatographic features and assembles them into pseudo-tandem MS spectra. These spectra can be identified with conventional database-searching and protein-inference tools, allowing sensitive, untargeted analysis of DIA data without the need for a spectral library. Quantification is done with both precursor- and fragment-ion intensities. Furthermore, DIA-Umpire enables targeted extraction of quantitative information based on peptides initially identified in only a subset of the samples, resulting in more consistent quantification across multiple samples. We demonstrated the performance of the method with control samples of varying complexity and publicly available glycoproteomics and affinity purification-MS data.
Identification of post-translationally or chemically modified peptides in mass spectrometry-based proteomics experiments is a crucial yet challenging task. We have recently introduced a fragment ion ...indexing method and the MSFragger search engine to empower an open search strategy for comprehensive analysis of modified peptides. However, this strategy does not consider fragment ions shifted by unknown modifications, preventing modification localization and limiting the sensitivity of the search. Here we present a localization-aware open search method, in which both modification-containing (shifted) and regular fragment ions are indexed and used in scoring. We also implement a fast mass calibration and optimization method, allowing optimization of the mass tolerances and other key search parameters. We demonstrate that MSFragger with mass calibration and localization-aware open search identifies modified peptides with significantly higher sensitivity and accuracy. Comparing MSFragger to other modification-focused tools (pFind3, MetaMorpheus, and TagGraph) shows that MSFragger remains an excellent option for fast, comprehensive, and sensitive searches for modified peptides in shotgun proteomics data.