The rapid evolution of HIV is constrained by interactions between mutations which affect viral fitness. In this work, we explore the role of epistasis in determining the mutational fitness landscape ...of HIV for multiple drug target proteins, including Protease, Reverse Transcriptase, and Integrase. Epistatic interactions between residues modulate the mutation patterns involved in drug resistance, with unambiguous signatures of epistasis best seen in the comparison of the Potts model predicted and experimental HIV sequence "prevalences" expressed as higher-order marginals (beyond triplets) of the sequence probability distribution. In contrast, experimental measures of fitness such as viral replicative capacities generally probe fitness effects of point mutations in a single background, providing weak evidence for epistasis in viral systems. The detectable effects of epistasis are obscured by higher evolutionary conservation at sites. While double mutant cycles in principle, provide one of the best ways to probe epistatic interactions experimentally without reference to a particular background, we show that the analysis is complicated by the small dynamic range of measurements. Overall, we show that global pairwise interaction Potts models are necessary for predicting the mutational landscape of viral proteins.
Single-point mutations in kinase proteins can affect their stability and fitness, and computational analysis of these effects can provide insights into the relationships among protein sequence, ...structure, and function for this enzyme family. To assess the impact of mutations on protein stability, we used a sequence-based Potts Hamiltonian model trained on a kinase family multiple-sequence alignment (MSA) to calculate the statistical energy (fitness) effects of mutations and compared these against relative folding free energies (ΔΔ
s) calculated from all-atom molecular dynamics free energy perturbation (FEP) simulations in explicit solvent. The fitness effects of mutations in the Potts model (Δ
s) showed good agreement with experimental thermostability data (Pearson
= 0.68), similar to the correlation we observed with ΔΔ
s predicted from structure-based relative FEP simulations. Recognizing the possible advantages of using Potts models to rapidly estimate protein stability effects of kinase mutations seen in cancer genomics data, we used the Potts statistical energy model to estimate the stability effects of 65 conservative and nonconservative mutations across three distinct kinases (Wee1, Abl1, and Cdc7) with somatic mutations reported in the Genomic Data Commons (GDC) database. The Δ
s of these mutations calculated from the Potts model are consistent with the corresponding ΔΔ
s from FEP simulations (Pearson ratio of 0.72). The agreement between these methods suggests that the Potts model may be used as a sequence-based tool for high-throughput screening of mutational effects as part of a computational pipeline for predicting the stability effects of mutations. We also demonstrate how the scalability of the fitness-based Potts model calculations permits analyses that are not easily accessed using FEP simulations. To this end, we employed site-saturation mutagenesis in the Potts model in order to investigate the relative stability effects of mutations seen in different cancer evolutionary scenarios. We used this approach to analyze the effects of drug pressure in Abl kinase by contrasting the relative fitness penalties of somatic mutations seen in miscellaneous cancer types with those calculated for mutations associated with cancer drug resistance. We observed that, in contrast to somatic mutations of Abl seen in various tumors that appear to have evolved neutrally, cancer mutations that evolved under drug pressure in Abl-targeted therapies tend to preserve enzyme stability.
Inverse Ising inference is a method for inferring the coupling parameters of a Potts/Ising model based on observed site-covariation, which has found important applications in protein physics for ...detecting interactions between residues in protein families. We introduce Mi3-GPU (“mee-three”, for MCMC Inverse Ising Inference) software for solving the inverse Ising problem for protein-sequence datasets with few analytic approximations, by parallel Markov-Chain Monte Carlo sampling on GPUs. We also provide tools for analysis and preparation of protein-family Multiple Sequence Alignments (MSAs) to account for finite-sampling issues, which are a major source of error or bias in inverse Ising inference. Our method is “generative” in the sense that the inferred model can be used to generate synthetic MSAs whose mutational statistics (marginals) can be verified to match the dataset MSA statistics up to the limits imposed by the effects of finite sampling. Our GPU implementation enables the construction of models which reproduce the covariation patterns of the observed MSA with a precision that is not possible with more approximate methods. The main components of our method are a GPU-optimized algorithm to greatly accelerate MCMC sampling, combined with a multi-step Quasi-Newton parameter-update scheme using a “Zwanzig reweighting” technique. We demonstrate the ability of this software to produce generative models on typical protein family datasets for sequence lengths L∼300 with 21 residue types with tens of millions of inferred parameters in short running times.
Program Title: Mi3-GPU
Program Files doi:http://dx.doi.org/10.17632/ftbcfy2p35.1
Licensing provisions: GPLv3
Programming languages: Python3, OpenCL, C
Nature of problem: Mi3-GPU solves the inverse Ising problem for application in protein covariation analysis. The goal is to infer “coupling” parameters between positions in a Multiple Sequence Alignment of a protein family, with many applications including protein-contact prediction and fitness prediction.
Solution method: Mi3-GPU solves the inverse Ising problem with few approximations using Markov-Chain Monte Carlo methods with Quasi-Newton optimization on GPUs. This problem previously has been approached by more approximate methods using analytic approximations including “message Passing”, “Susceptibility Propagation”, “mean-field” methods, pseudolikelihood approximations, and cluster expansion. The software leverages GPU to accelerate MCMC sampling and a histogram reweighting technique to accelerate parameter optimization.
•A Potts Hamiltonian can be inferred from a protein multiple sequence alignment.•Potts energies predict sequence-dependent structure and free energy landscapes.•Potts energies serve as a proxy for ...protein fitness, incorporating epistasis.•We review inference techniques and their use in particular Potts model applications.
Potts Hamiltonian models of protein sequence co-variation are statistical models constructed from the pair correlations observed in a multiple sequence alignment (MSA) of a protein family. These models are powerful because they capture higher order correlations induced by mutations evolving under constraints and help quantify the connections between protein sequence, structure, and function maintained through evolution. We review recent work with Potts models to predict protein structure and sequence-dependent conformational free energy landscapes, to survey protein fitness landscapes and to explore the effects of epistasis on fitness. We also comment on the numerical methods used to infer these models for each application.
Immune cell infiltration in the tumor microenvironment is of prognostic and therapeutic import. These immune cell subsets can be heterogeneous and are composed of mature antigen-presenting cells, ...helper and effector cytotoxic T cells, toleragenic dendritic cells, tumor-associated macrophages, and regulatory T cells, among other cell types. With the development of novel drugs that target the immune system rather than the cancer cells, the tumor immune microenvironment is not only prognostic for overall patient outcome, but also predictive for likelihood of response to these immune-targeted therapies. Such therapies aim to reverse the cancer immunotolerance and trigger an effective antitumor immune response. Two major families of immunostimulatory drugs are currently in clinical development: pattern recognition receptor agonists (PRRago) and immunostimulatory monoclonal antibodies (ISmAb). Despite their immune-targeted design, these agents have so far been developed clinically as if they were typical anticancer drugs. Here, we review the limitations of this conventional approach, specifically addressing the shortcomings of the usual schedules of intravenous infusions every 2 or 3 weeks. If the new modalities of immunotherapy target specific immune cells within the tumor microenvironment, it might be preferable to deliver them locally into the tumor rather than systemically. There is preclinical and clinical evidence that a therapeutic systemic antitumor immune response can be generated upon intratumoral immunomodulation. Moreover, preclinical results have shown that therapeutic synergy can be obtained by combining PRRagos and ISmAbs to the local tumor site.
Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict mutation effects. Despite ...encouraging results, current model evaluation metrics leave unclear whether GPSMs faithfully reproduce the complex multi-residue mutational patterns observed in natural sequences due to epistasis. Here, we develop a set of sequence statistics to assess the "generative capacity" of three current GPSMs: the pairwise Potts Hamiltonian, the VAE, and the site-independent model. We show that the Potts model's generative capacity is largest, as the higher-order mutational statistics generated by the model agree with those observed for natural sequences, while the VAE's lies between the Potts and site-independent models. Importantly, our work provides a new framework for evaluating and interpreting GPSM accuracy which emphasizes the role of higher-order covariation and epistasis, with broader implications for probabilistic sequence models in general.
Protein kinases are molecular machines with rich sequence variation that distinguishes the two main evolutionary branches – tyrosine kinases (TKs) from serine/threonine kinases (STKs). Using a ...sequence co-variation Potts statistical energy model we previously concluded that TK catalytic domains are more likely than STKs to adopt an inactive conformation with the activation loop in an autoinhibitory folded conformation, due to intrinsic sequence effects. Here we investigate the structural basis for this phenomenon by integrating the sequence-based model with structure-based molecular dynamics (MD) to determine the effects of mutations on the free energy difference between active and inactive conformations, using a thermodynamic cycle involving many (n = 108) protein-mutation free energy perturbation (FEP) simulations in the active and inactive conformations. The sequence and structure-based results are consistent and support the hypothesis that the inactive conformation DFG-out Activation Loop Folded, is a functional regulatory state that has been stabilized in TKs relative to STKs over the course of their evolution via the accumulation of residue substitutions in the activation loop and catalytic loop that facilitate distinct substrate binding modes in trans and additional modes of regulation in cis for TKs.In this study, the authors identify a mechanism for the distinct conformational preferences of tyrosine kinases vs serine/threonine kinases and suggest that the evolution of tyrosine kinase function can explain these conformational differences.
The weighted histogram analysis method (WHAM) is routinely used for computing free energies and expectations from multiple ensembles. Existing derivations of WHAM require observations to be ...discretized into a finite number of bins. Yet, WHAM formulas seem to hold even if the bin sizes are made arbitrarily small. The purpose of this article is to demonstrate both the validity and value of the multi-state Bennet acceptance ratio (MBAR) method seen as a binless extension of WHAM. We discuss two statistical arguments to derive the MBAR equations, in parallel to the self-consistency and maximum likelihood derivations already known for WHAM. We show that the binless method, like WHAM, can be used not only to estimate free energies and equilibrium expectations, but also to estimate equilibrium distributions. We also provide a number of useful results from the statistical literature, including the determination of MBAR estimators by minimization of a convex function. This leads to an approach to the computation of MBAR free energies by optimization algorithms, which can be more effective than existing algorithms. The advantages of MBAR are illustrated numerically for the calculation of absolute protein-ligand binding free energies by alchemical transformations with and without soft-core potentials. We show that binless statistical analysis can accurately treat sparsely distributed interaction energy samples as obtained from unmodified interaction potentials that cannot be properly analyzed using standard binning methods. This suggests that binless multi-state analysis of binding free energy simulations with unmodified potentials offers a straightforward alternative to the use of soft-core potentials for these alchemical transformations.