The prediction of enzyme activity is one of the main challenges in catalysis. With computer-aided methods, it is possible to simulate the reaction mechanism at the atomic level. However, these ...methods are usually expensive if they are to be used on a large scale, as they are needed for protein engineering campaigns. To alleviate this situation, machine learning methods can help in the generation of predictive-decision models. Herein, we test different regression algorithms for the prediction of the reaction energy barrier of the rate-limiting step of the hydrolysis of mono-(2-hydroxyethyl)terephthalic acid by the MHETase ofIdeonella sakaiensis. As a training data set, we use steered quantum mechanics/molecular mechanics (QM/MM) molecular dynamics (MD) simulation snapshots and their corresponding pulling work values. We have explored three algorithms together with three chemical representations. As an outcome, our trained models are able to predict pulling works along the steered QM/MM MD simulations with a mean absolute error below 3 kcal mol–1 and a score value above 0.90. More challenging is the prediction of the energy maximum with a single geometry. Whereas the use of the initial snapshot of the QM/MM MD trajectory as input geometry yields a very poor prediction of the reaction energy barrier, the use of an intermediate snapshot of the former trajectory brings the score value above 0.40 with a low mean absolute error (ca. 3 kcal mol–1). Altogether, we have faced in this work some initial challenges of the final goal of getting an efficient workflow for the semiautomatic prediction of enzyme-catalyzed energy barriers and catalytic efficiencies.
Abstract
Motivation
Density Peaks is a widely spread clustering algorithm that has been previously applied to Molecular Dynamics (MD) simulations. Its conception of cluster centers as elements ...displaying both a high density of neighbors and a large distance to other elements of high density, particularly fits the nature of a geometrical converged MD simulation. Despite its theoretical convenience, implementations of Density Peaks carry a quadratic memory complexity that only permits the analysis of relatively short trajectories.
Results
Here, we describe DP+, an exact novel implementation of Density Peaks that drastically reduces the RAM consumption in comparison to the scarcely available alternatives designed for MD. Based on DP+, we developed RCDPeaks, a refined variant of the original Density Peaks algorithm. Through the use of DP+, RCDPeaks was able to cluster a one-million frames trajectory using less than 4.5 GB of RAM, a task that would have taken more than 2 TB and about 3× more time with the fastest and less memory-hunger alternative currently available. Other key features of RCDPeaks include the automatic selection of parameters, the screening of center candidates and the geometrical refining of returned clusters.
Availability and implementation
The source code and documentation of RCDPeaks are free and publicly available on GitHub (https://github.com/LQCT/RCDPeaks.git).
Supplementary information
Supplementary data are available at Bioinformatics online.
The term clustering designates a comprehensive family of unsupervised learning methods allowing to group similar elements into sets called clusters. Geometrical clustering of molecular dynamics (MD) ...trajectories is a well-established analysis to gain insights into the conformational behavior of simulated systems. However, popular variants collapse when processing relatively long trajectories because of their quadratic memory or time complexity. From the arsenal of clustering algorithms, HDBSCAN stands out as a hierarchical density-based alternative that provides robust differentiation of intimately related elements from noise data. Although a very efficient implementation of this algorithm is available for programming-skilled users (HDBSCAN*), it cannot treat long trajectories under the de facto molecular similarity metric RMSD.
Here, we propose MDSCAN, an HDBSCAN-inspired software specifically conceived for non-programmers users to perform memory-efficient RMSD-based clustering of long MD trajectories. Methodological improvements over the original version include the encoding of trajectories as a particular class of vantage-point tree (decreasing time complexity), and a dual-heap approach to construct a quasi-minimum spanning tree (reducing memory complexity). MDSCAN was able to process a trajectory of 1 million frames using the RMSD metric in about 21 h with <8 GB of RAM, a task that would have taken a similar time but more than 32 TB of RAM with the accelerated HDBSCAN* implementation generally used.
The source code and documentation of MDSCAN are free and publicly available on GitHub (https://github.com/LQCT/MDScan.git) and as a PyPI package (https://pypi.org/project/mdscan/).
Supplementary data are available at Bioinformatics online.
Abstract
Motivation
Classical Molecular Dynamics (MD) is a standard computational approach to model time-dependent processes at the atomic level. The inherent sparsity of increasingly huge generated ...trajectories demands clustering algorithms to reduce other post-simulation analysis complexity. The Quality Threshold (QT) variant is an appealing one from the vast number of available clustering methods. It guarantees that all members of a particular cluster will maintain a collective similarity established by a user-defined threshold. Unfortunately, its high computational cost for processing big data limits its application in the molecular simulation field.
Results
In this work, we propose a methodological parallel between QT clustering and another well-known algorithm in the field of Graph Theory, the Maximum Clique Problem. Molecular trajectories are represented as graphs whose nodes designate conformations, while unweighted edges indicate mutual similarity between nodes. The use of a binary-encoded RMSD matrix coupled to the exploitation of bitwise operations to extract clusters significantly contributes to reaching a very affordable algorithm compared to the few implementations of QT for MD available in the literature. Our alternative provides results in good agreement with the exact one while strictly preserving the collective similarity of clusters.
Availability and implementation
The source code and documentation of BitQT are free and publicly available on GitHub (https://github.com/LQCT/BitQT.git) and ReadTheDocs (https://bitqt.readthedocs.io/en/latest/), respectively.
Supplementary information
Supplementary data are available at Bioinformatics online.