Machine learning (ML) models, such as artificial neural networks, have emerged as a complement to high-throughput screening, enabling characterization of new compounds in seconds instead of hours. ...The promise of ML models to enable large-scale chemical space exploration can only be realized if it is straightforward to identify when molecules and materials are outside the model's domain of applicability. Established uncertainty metrics for neural network models are either costly to obtain (
e.g.
, ensemble models) or rely on feature engineering (
e.g.
, feature space distances), and each has limitations in estimating prediction errors for chemical space exploration. We introduce the distance to available data in the latent space of a neural network ML model as a low-cost, quantitative uncertainty metric that works for both inorganic and organic chemistry. The calibrated performance of this approach exceeds widely used uncertainty metrics and is readily applied to models of increasing complexity at no additional cost. Tightening latent distance cutoffs systematically drives down predicted model errors below training errors, thus enabling predictive error control in chemical discovery or identification of useful data points for active learning.
A predictive approach for driving down machine learning model errors is introduced and demonstrated across discovery for inorganic and organic chemistry.
Transition-metal complexes are attractive targets for the design of catalysts and functional materials. The behavior of the metal–organic bond, while very tunable for achieving target properties, is ...challenging to predict and necessitates searching a wide and complex space to identify needles in haystacks for target applications. This review will focus on the techniques that make high-throughput search of transition-metal chemical space feasible for the discovery of complexes with desirable properties. The review will cover the development, promise, and limitations of “traditional” computational chemistry (i.e., force field, semiempirical, and density functional theory methods) as it pertains to data generation for inorganic molecular discovery. The review will also discuss the opportunities and limitations in leveraging experimental data sources. We will focus on how advances in statistical modeling, artificial intelligence, multiobjective optimization, and automation accelerate discovery of lead compounds and design rules. The overall objective of this review is to showcase how bringing together advances from diverse areas of computational chemistry and computer science have enabled the rapid uncovering of structure–property relationships in transition-metal chemistry. We aim to highlight how unique considerations in motifs of metal–organic bonding (e.g., variable spin and oxidation state, and bonding strength/nature) set them and their discovery apart from more commonly considered organic molecules. We will also highlight how uncertainty and relative data scarcity in transition-metal chemistry motivate specific developments in machine learning representations, model training, and in computational chemistry. Finally, we will conclude with an outlook of areas of opportunity for the accelerated discovery of transition-metal complexes.
Machine learning the electronic structure of open shell transition metal complexes presents unique challenges, including robust and automated data set generation. Here, we introduce tools that ...simplify data acquisition from density functional theory (DFT) and validation of trained machine learning models using the molSimplify automatic design (mAD) workflow. We demonstrate this workflow by training and comparing the performance of LASSO, kernel ridge regression (KRR), and artificial neural network (ANN) models using heuristic, topological revised autocorrelation (RAC) descriptors we have recently introduced for machine learning inorganic chemistry. On a series of open shell transition metal complexes, we evaluate set aside test errors of these models for predicting the HOMO level and HOMO–LUMO gap. The best performing models are ANNs, which show 0.15 and 0.25 eV test set mean absolute errors on the HOMO level and HOMO–LUMO gap, respectively. Poor performing KRR models using the full 153-feature RAC set are improved to nearly the same performance as the ANNs when trained on down-selected subsets of 20–30 features. Analysis of the essential descriptors for HOMO level and HOMO–LUMO gap prediction as well as comparison to subsets previously obtained for other properties reveal the paramount importance of nonlocal, steric properties in determining frontier molecular orbital energetics. We demonstrate our model performance on diverse complexes and in the discovery of molecules with target HOMO–LUMO gaps from a large 15,000 molecule design space in minutes rather than days that full DFT evaluation would require.
The accelerated discovery of materials for real world applications requires the achievement of multiple design objectives. The multidimensional nature of the search necessitates exploration of ...multimillion compound libraries over which even density functional theory (DFT) screening is intractable. Machine learning (e.g., artificial neural network, ANN, or Gaussian process, GP) models for this task are limited by training data availability and predictive uncertainty quantification (UQ). We overcome such limitations by using efficient global optimization (EGO) with the multidimensional expected improvement (EI) criterion. EGO balances exploitation of a trained model with acquisition of new DFT data at the Pareto front, the region of chemical space that contains the optimal trade-off between multiple design criteria. We demonstrate this approach for the simultaneous optimization of redox potential and solubility in candidate M(II)/M(III) redox couples for redox flow batteries from a space of 2.8 M transition metal complexes designed for stability in practical redox flow battery (RFB) applications. We show that a multitask ANN with latent-distance-based UQ surpasses the generalization performance of a GP in this space. With this approach, ANN prediction and EI scoring of the full space are achieved in minutes. Starting from ca. 100 representative points, EGO improves both properties by over 3 standard deviations in only five generations. Analysis of lookahead errors confirms rapid ANN model improvement during the EGO process, achieving suitable accuracy for predictive design in the space of transition metal complexes. The ANN-driven EI approach achieves at least 500-fold acceleration over random search, identifying a Pareto-optimal design in around 5 weeks instead of 50 years.
We report a workflow and the output of a natural language processing (NLP)-based procedure to mine the extant metal-organic framework (MOF) literature describing structurally characterized MOFs and ...their solvent removal and thermal stabilities. We obtain over 2,000 solvent removal stability measures from text mining and 3,000 thermal decomposition temperatures from thermogravimetric analysis data. We assess the validity of our NLP methods and the accuracy of our extracted data by comparing to a hand-labeled subset. Machine learning (ML, i.e. artificial neural network) models trained on this data using graph- and pore-geometry-based representations enable prediction of stability on new MOFs with quantified uncertainty. Our web interface, MOFSimplify, provides users access to our curated data and enables them to harness that data for predictions on new MOFs. MOFSimplify also encourages community feedback on existing data and on ML model predictions for community-based active learning for improved MOF stability models.
With a decomposition scheme for the bath correlation function, the hierarchy equation of motion (HEOM) is extended to the zero-temperature sub-Ohmic spin-boson model, providing a numerically accurate ...prediction of quantum dynamics. As a dynamic approach, the extended HEOM determines the delocalized-localized (DL) phase transition from the extracted rate kernel and the coherent-incoherent dynamic transition from the short-time oscillation. As the bosonic bath approaches from the strong to weak sub-Ohmic regimes, a crossover behavior is identified for the critical Kondo parameter of the DL transition, accompanied by the transition from the coherent to incoherent dynamics in the localization.
With the increasingly more important role of machine learning (ML) models in chemical research, the need for putting a level of confidence to the model predictions naturally arises. Several methods ...for obtaining uncertainty estimates have been proposed in recent years but consensus on the evaluation of these have yet to be established and different studies on uncertainties generally uses different metrics to evaluate them. We compare three of the most popular validation metrics (Spearman’s rank correlation coefficient, the negative log likelihood (NLL) and the miscalibration area) to the error-based calibration introduced by Levi et al. (
Sensors
2022
,
22
, 5540). Importantly, metrics such as the negative log likelihood (NLL) and Spearman’s rank correlation coefficient bear little information in themselves. We therefore introduce reference values obtained through errors simulated directly from the uncertainty distribution. The different metrics target different properties and we show how to interpret them, but we generally find the best overall validation to be done based on the error-based calibration plot introduced by Levi et al. Finally, we illustrate the sensitivity of ranking-based methods (e.g. Spearman’s rank correlation coefficient) towards test set design by using the same toy model ferent test sets and obtaining vastly different metrics (0.05 vs. 0.65).
Although the tailored metal active sites and porous architectures of MOFs hold great promise for engineering challenges ranging from gas separations to catalysis, a lack of understanding of how to ...improve their stability limits their use in practice. To overcome this limitation, we extract thousands of published reports of the key aspects of MOF stability necessary for their practical application: the ability to withstand high temperatures without degrading and the capacity to be activated by removal of solvent molecules. From nearly 4000 manuscripts, we use natural language processing and image analysis to obtain over 2000 solvent-removal stability measures and 3000 thermal degradation temperatures. We analyze the relationships between stability properties and the chemical and geometric structures in this set to identify limits of prior heuristics derived from smaller sets of MOFs. By training predictive machine learning (ML, i.e., Gaussian process and artificial neural network) models to encode the structure–property relationships with graph- and pore-structure-based representations, we are able to make predictions of stability orders of magnitude faster than conventional physics-based modeling or experiment. Interpretation of important features in ML models provides insights that we use to identify strategies to engineer increased stability into typically unstable 3d-transition-metal-containing MOFs that are frequently targeted for catalytic applications. We expect our approach to accelerate the time to discovery of stable, practical MOF materials for a wide range of applications.
Owing to increasing global demand for carbon neutral and fossil‐free energy systems, extensive research is being conducted on efficient and inexpensive electrocatalysts for catalyzing the kinetically ...sluggish oxygen reduction reaction (ORR) at the cathode of fuel cells. Platinum (Pt)‐based alloys are considered promising candidates for replacing expensive Pt catalysts. However, the current screening process of Pt‐based alloys is time‐consuming and labor‐intensive, and the descriptor for predicting the activity of Pt‐based catalysts is generally inaccurate. This study proposed a strategy by combining high‐throughput first‐principles calculations and machine learning to explore the descriptor used for screening Pt‐based alloy catalysts with high Pt utilization and low Pt consumption. Among the 77 prescreened candidates, we identified 5 potential candidates for catalyzing ORR with low overpotential. Furthermore, during the second and third rounds of active learning, more Pt‐based alloys ORR candidates are identified based on the relationship between structural features of Pt‐based alloys and their activity. In addition, we highlighted the role of structural features in Pt‐based alloys and found that the difference between the electronegativity of Pt and heteroatom, the valence electrons number of the heteroatom, and the ratio of heteroatoms around Pt are the main factors that affect the activity of ORR. More importantly, the combination of those structural features can be used as structural descriptor for predicting the activity of Pt‐based alloys. We believe the findings of this study will provide new insight for predicting ORR activity and contribute to exploring Pt‐based electrocatalysts with high Pt utilization and low Pt consumption experimentally.
The current screening process of platinum (Pt)‐based alloys is time‐consuming and labor‐intensive, and the descriptor for predicting the activity is generally inaccurate. This study proposed a strategy by combining high‐throughput first‐principles calculations and machine learning (ML) to explore the structural descriptor used for predicting the activity of Pt‐based alloys and screening Pt‐based alloy catalysts with high Pt utilization and low Pt consumption. We believe the results will provide a useful dataset for experimentalists to further scrutinize the predicted oxygen reduction reaction (ORR) activity as well as for data scientists to construct ML models for ORR performance predictions.