The electronic Schrödinger equation can only be solved analytically for the hydrogen atom, and the numerically exact full configuration-interaction method is exponentially expensive in the number of ...electrons. Quantum Monte Carlo methods are a possible way out: they scale well for large molecules, they can be parallelized and their accuracy has, as yet, been only limited by the flexibility of the wavefunction ansatz used. Here we propose PauliNet, a deep-learning wavefunction ansatz that achieves nearly exact solutions of the electronic Schrödinger equation for molecules with up to 30 electrons. PauliNet has a multireference Hartree-Fock solution built in as a baseline, incorporates the physics of valid wavefunctions and is trained using variational quantum Monte Carlo. PauliNet outperforms previous state-of-the-art variational ansatzes for atoms, diatomic molecules and a strongly correlated linear H
, and matches the accuracy of highly specialized quantum chemistry methods on the transition-state energy of cyclobutadiene, while being computationally efficient.
Inference, prediction, and control of complex dynamical systems from time series is important in many areas, including financial markets, power grid management, climate and weather modeling, or ...molecular dynamics. The analysis of such highly nonlinear dynamical systems is facilitated by the fact that we can often find a (generally nonlinear) transformation of the system coordinates to features in which the dynamics can be excellently approximated by a linear Markovian model. Moreover, the large number of system variables often change collectively on large time- and length-scales, facilitating a low-dimensional analysis in feature space. In this paper, we introduce a variational approach for Markov processes (VAMP) that allows us to find optimal feature mappings and optimal Markovian models of the dynamics from given time series data. The key insight is that the best linear model can be obtained from the top singular components of the Koopman operator. This leads to the definition of a family of score functions called VAMP-
r
which can be calculated from data, and can be employed to optimize a Markovian model. In addition, based on the relationship between the variational scores and approximation errors of Koopman operators, we propose a new VAMP-E score, which can be applied to cross-validation for hyper-parameter optimization and model selection in VAMP. VAMP is valid for both reversible and nonreversible processes and for stationary and nonstationary processes or realizations.
Characterizing macromolecular kinetics from molecular dynamics (MD) simulations requires a distance metric that can distinguish slowly interconverting states. Here, we build upon diffusion map theory ...and define a kinetic distance metric for irreducible Markov processes that quantifies how slowly molecular conformations interconvert. The kinetic distance can be computed given a model that approximates the eigenvalues and eigenvectors (reaction coordinates) of the MD Markov operator. Here, we employ the time-lagged independent component analysis (TICA). The TICA components can be scaled to provide a kinetic map in which the Euclidean distance corresponds to the kinetic distance. As a result, the question of how many TICA dimensions should be kept in a dimensionality reduction approach becomes obsolete, and one parameter less needs to be specified in the kinetic model construction. We demonstrate the approach using TICA and Markov state model (MSM) analyses for illustrative models, protein conformation dynamics in bovine pancreatic trypsin inhibitor and protein-inhibitor association in trypsin and benzamidine. We find that the total kinetic variance (TKV) is an excellent indicator of model quality and can be used to rank different input feature sets.
Unsupervised learning is becoming an essential tool to analyze the increasingly large amounts of data produced by atomistic and molecular simulations, in material science, solid state physics, ...biophysics, and biochemistry. In this Review, we provide a comprehensive overview of the methods of unsupervised learning that have been most commonly used to investigate simulation data and indicate likely directions for further developments in the field. In particular, we discuss feature representation of molecular systems and present state-of-the-art algorithms of dimensionality reduction, density estimation, and clustering, and kinetic models. We divide our discussion into self-contained sections, each discussing a specific method. In each section, we briefly touch upon the mathematical and algorithmic foundations of the method, highlight its strengths and limitations, and describe the specific ways in which it has been used-or can be used-to analyze molecular simulation data.
Atomistic or ab initio molecular dynamics simulations are widely used to predict thermodynamics and kinetics and relate them to molecular structure. A common approach to go beyond the time- and ...length-scales accessible with such computationally expensive simulations is the definition of coarse-grained molecular models. Existing coarse-graining approaches define an effective interaction potential to match defined properties of high-resolution models or experimental data. In this paper, we reformulate coarse-graining as a supervised machine learning problem. We use statistical learning theory to decompose the coarse-graining error and cross-validation to select and compare the performance of different models. We introduce CGnets, a deep learning approach, that learns coarse-grained free energy functions and can be trained by a force-matching scheme. CGnets maintain all physically relevant invariances and allow one to incorporate prior physics knowledge to avoid sampling of unphysical structures. We show that CGnets can capture all-atom explicit-solvent free energy surfaces with models using only a few coarse-grained beads and no solvent, while classical coarse-graining methods fail to capture crucial features of the free energy surface. Thus, CGnets are able to capture multibody terms that emerge from the dimensionality reduction.
Biomembranes are two-dimensional assemblies of phospholipids that are only a few nanometres thick, but form micrometre-sized structures vital to cellular function. Explicit molecular modelling of ...biologically relevant membrane systems is computationally expensive due to the large number of solvent particles and slow membrane kinetics. Coarse-grained solvent-free membrane models offer efficient sampling but sacrifice realistic kinetics, thereby limiting the ability to predict pathways and mechanisms of membrane processes. Here, we present a framework for integrating coarse-grained membrane models with continuum-based hydrodynamics. This framework facilitates efficient simulation of large biomembrane systems with large timesteps, while achieving realistic equilibrium and non-equilibrium kinetics. It helps to bridge between the nanometer/nanosecond spatiotemporal resolutions of coarse-grained models and biologically relevant time- and lengthscales. As a demonstration, we investigate fluctuations of red blood cells, with varying cytoplasmic viscosities, in 150-milliseconds-long trajectories, and compare kinetic properties against single-cell experimental observations.
Recent advances in molecular simulations have allowed scientists to investigate slower biological processes than ever before. Together with these advances came an explosion of data that has ...transformed a traditionally computing-bound into a data-bound problem. Here, we present HTMD, a programmable, extensible platform written in Python that aims to solve the data generation and analysis problem as well as increase reproducibility by providing a complete workspace for simulation-based discovery. So far, HTMD includes system building for CHARMM and AMBER force fields, projection methods, clustering, molecular simulation production, adaptive sampling, an Amazon cloud interface, Markov state models, and visualization. As a result, a single, short HTMD script can lead from a PDB structure to useful quantities such as relaxation time scales, equilibrium populations, metastable conformations, and kinetic rates. In this paper, we focus on the adaptive sampling and Markov state modeling features.
Markov (state) models (MSMs) and related models of molecular kinetics have recently received a surge of interest as they can systematically reconcile simulation data from either a few long or many ...short simulations and allow us to analyze the essential metastable structures, thermodynamics, and kinetics of the molecular system under investigation. However, the estimation, validation, and analysis of such models is far from trivial and involves sophisticated and often numerically sensitive methods. In this work we present the open-source Python package PyEMMA (http://pyemma.org) that provides accurate and efficient algorithms for kinetic model construction. PyEMMA can read all common molecular dynamics data formats, helps in the selection of input features, provides easy access to dimension reduction algorithms such as principal component analysis (PCA) and time-lagged independent component analysis (TICA) and clustering algorithms such as k-means, and contains estimators for MSMs, hidden Markov models, and several other models. Systematic model validation and error calculation methods are provided. PyEMMA offers a wealth of analysis functions such that the user can conveniently compute molecular observables of interest. We have derived a systematic and accurate way to coarse-grain MSMs to few states and to illustrate the structures of the metastable states of the system. Plotting functions to produce a manuscript-ready presentation of the results are available. In this work, we demonstrate the features of the software and show new methodological concepts and results produced by PyEMMA.
Identification of the main reaction coordinates and building of kinetic models of macromolecular systems require a way to measure distances between molecular configurations that can distinguish ...slowly interconverting states. Here we define the commute distance that can be shown to be closely related to the expected commute time needed to go from one configuration to the other, and back. A practical merit of this quantity is that it can be easily approximated from molecular dynamics data sets when an approximation of the Markov operator eigenfunctions is available, which can be achieved by the variational approach to approximate eigenfunctions of Markov operators, also called variational approach of conformation dynamics (VAC) or the time-lagged independent component analysis (TICA). The VAC or TICA components can be scaled such that a so-called commute map is obtained in which Euclidean distance corresponds to the commute distance, and thus kinetic models such as Markov state models can be computed based on Euclidean operations, such as standard clustering. In addition, the distance metric gives rise to a quantity we call total kinetic content, which is an excellent score to rank input feature sets and kinetic model quality.