Applied datasets can vary from a few hundred to thousands of samples in typical quantitative structure-activity/property (QSAR/QSPR) relationships and classification. However, the size of the ...datasets and the train/test split ratios can greatly affect the outcome of the models, and thus the classification performance itself. We compared several combinations of dataset sizes and split ratios with five different machine learning algorithms to find the differences or similarities and to select the best parameter settings in nonbinary (multiclass) classification. It is also known that the models are ranked differently according to the performance merit(s) used. Here, 25 performance parameters were calculated for each model, then factorial ANOVA was applied to compare the results. The results clearly show the differences not just between the applied machine learning algorithms but also between the dataset sizes and to a lesser extent the train/test split ratios. The XGBoost algorithm could outperform the others, even in multiclass modeling. The performance parameters reacted differently to the change of the sample set size; some of them were much more sensitive to this factor than the others. Moreover, significant differences could be detected between train/test split ratios as well, exerting a great effect on the test validation of our models.
Background
Cheminformaticians are equipped with a very rich toolbox when carrying out molecular similarity calculations. A large number of molecular representations exist, and there are several ...methods (similarity and distance metrics) to quantify the similarity of molecular representations. In this work, eight well-known similarity/distance metrics are compared on a large dataset of molecular fingerprints with sum of ranking differences (SRD) and ANOVA analysis. The effects of molecular size, selection methods and data pretreatment methods on the outcome of the comparison are also assessed.
Results
A supplier database (
https://mcule.com/
) was used as the source of compounds for the similarity calculations in this study. A large number of datasets, each consisting of one hundred compounds, were compiled, molecular fingerprints were generated and similarity values between a randomly chosen reference compound and the rest were calculated for each dataset. Similarity metrics were compared based on their ranking of the compounds within one experiment (one dataset) using sum of ranking differences (SRD), while the results of the entire set of experiments were summarized on box and whisker plots. Finally, the effects of various factors (data pretreatment, molecule size, selection method) were evaluated with analysis of variance (ANOVA).
Conclusions
This study complements previous efforts to examine and rank various metrics for molecular similarity calculations. Here, however, an entirely general approach was taken to neglect any
a priori
knowledge on the compounds involved, as well as any bias introduced by examining only one or a few specific scenarios. The Tanimoto index, Dice index, Cosine coefficient and Soergel distance were identified to be the best (and in some sense equivalent) metrics for similarity calculations,
i.e
. these metrics could produce the rankings closest to the composite (average) ranking of the eight metrics. The similarity metrics derived from Euclidean and Manhattan distances are not recommended on their own, although their variability and diversity from other similarity metrics might be advantageous in certain cases (
e.g.
for data fusion). Conclusions are also drawn regarding the effects of molecule size, selection method and data pretreatment on the ranking behavior of the studied metrics.
Graphical Abstract
A visual summary of the comparison of similarity metrics with sum of ranking differences (SRD).
Machine learning classification algorithms are widely used for the prediction and classification of the different properties of molecules such as toxicity or biological activity. the prediction of ...toxic vs. non-toxic molecules is important due to testing on living animals, which has ethical and cost drawbacks as well. The quality of classification models can be determined with several performance parameters. which often give conflicting results. In this study, we performed a multi-level comparison with the use of different performance metrics and machine learning classification methods. Well-established and standardized protocols for the machine learning tasks were used in each case. The comparison was applied to three datasets (acute and aquatic toxicities) and the robust, yet sensitive, sum of ranking differences (SRD) and analysis of variance (ANOVA) were applied for evaluation. The effect of dataset composition (balanced vs. imbalanced) and 2-class vs. multiclass classification scenarios was also studied. Most of the performance metrics are sensitive to dataset composition, especially in 2-class classification problems. The optimal machine learning algorithm also depends significantly on the composition of the dataset.
Background
Interaction fingerprints (IFP) have been repeatedly shown to be valuable tools in virtual screening to identify novel hit compounds that can subsequently be optimized to drug candidates. ...As a complementary method to ligand docking, IFPs can be applied to quantify the similarity of predicted binding poses to a reference binding pose. For this purpose, a large number of similarity metrics can be applied, and various parameters of the IFPs themselves can be customized. In a large-scale comparison, we have assessed the effect of similarity metrics and IFP configurations to a number of virtual screening scenarios with ten different protein targets and thousands of molecules. Particularly, the effect of considering general interaction definitions (such as Any Contact, Backbone Interaction and Sidechain Interaction), the effect of filtering methods and the different groups of similarity metrics were studied.
Results
The performances were primarily compared based on AUC values, but we have also used the original similarity data for the comparison of similarity metrics with several statistical tests and the novel, robust sum of ranking differences (SRD) algorithm. With SRD, we can evaluate the consistency (or concordance) of the various similarity metrics to an ideal reference metric, which is provided by data fusion from the existing metrics. Different aspects of IFP configurations and similarity metrics were examined based on SRD values with analysis of variance (ANOVA) tests.
Conclusion
A general approach is provided that can be applied for the reliable interpretation and usage of similarity measures with interaction fingerprints. Metrics that are viable alternatives to the commonly used Tanimoto coefficient were identified based on a comparison with an ideal reference metric (consensus). A careful selection of the applied bits (interaction definitions) and IFP filtering rules can improve the results of virtual screening (in terms of their agreement with the consensus metric). The open-source Python package FPKit was introduced for the similarity calculations and IFP filtering; it is available at:
https://github.com/davidbajusz/fpkit
.
QSAR/QSPR (quantitative structure‐activity/property relationship) modeling has been a prevalent approach in various, overlapping sub‐fields of computational, medicinal and environmental chemistry for ...decades. The generation and selection of molecular descriptors is an essential part of this process. In typical QSAR workflows, the starting pool of molecular descriptors is rationalized based on filtering out descriptors which are (i) constant throughout the whole dataset, or (ii) very strongly correlated to another descriptor. While the former is fairly straightforward, the latter involves a level of subjectivity when deciding what exactly is considered to be a strong correlation. Despite that, most QSAR modeling studies do not report on this step. In this study, we examine in detail the effect of various possible descriptor intercorrelation limits on the resulting QSAR models. Statistical comparisons are carried out based on four case studies from contemporary QSAR literature, using a combined methodology based on sum of ranking differences (SRD) and analysis of variance (ANOVA).
Hungarians arrived at the Carpathian Basin at around 895–900 and after a long journey from the east they occupied the interior plains, mostly the river valleys (in Hungarian history, this event is ...referred to as the Conquest). The previous tribal alliance had slowly disintegrated by the time of king Stephen I (1001–1038) when pagan beliefs were replaced by Christianity. The peripheral areas of the Kingdom of Hungary, however, were typically uninhabited until the 12th century when the ethnic landscape started changing with the arrival of Saxon settlers, Slavs, Romanians, and Pechenegs. We have no Hungarian written sources from the time preceding the Conquest. The early Latin (less frequently Greek) written sources contain Hungarian words and expressions only sporadically and they are mostly proper names designating places. However, due to their early appearance and low number, these have proved to be truly valuable for linguistics and historical studies exploring the early history of Hungarians and the ethnic and population history of the contemporary Carpathian Basin. In this respect, the settlement names rooted in ethnonyms have a key role as they also shed light on relations between Hungarians and other peoples. This paper studies settlement names that may refer to Eastern Slavic settlers designated by the ethnonym orosz in the medieval Hungarian language. The ethnic groups designated by this name were first registered in the 11th–12th century, however, groups of Slavs could have joined the Hungarian populace before the Conquest. The study shows that the highest proportion of settlement names derived from this ethnonym are found in the northeastern, northern, as well as eastern regions of early medieval Hungary, mostly along the border of the country. The author describes the most frequent name formation patterns that can also be used for relative dating of oikonyms, and discusses the extension to which these data may be useful for the reconstruction of the ethnic landscape of medieval Hungary.
Rapid population growth necessitates a continuous increase in industrial productivity, with a concomitant environmental burden. During the past few years, nanostructured carbon materials have proved ...their effectiveness in reducing the usage of several hazardous substances, owing to their distinct characteristics. These properties depend on the type of feedstock and the parameters of pyrolysis. We have developed multivariate prediction models to determine several physical properties of nanostructured carbon samples, which are usually calculated from the adsorption isotherms. Adsorption measurement is a time-consuming and exhaustive process. Therefore, our goal was to provide a fast and environmentally friendly alternative to current methods to determine micropore volume, specific surface area and total pore volume, based on FTIR spectroscopy coupled with chemometric methods. Moreover, we have created classification models, which are capable of predicting the used feedstock materials from IR spectra. Our support vector machine–based classification model had the best accuracy values, above 0.86. Our classification and regression models have excellent performance and were properly validated, thus they are good alternatives for the robust and fast determination of the important qualitative and quantitative features of carbon samples.
Display omitted
•OPLS-based models were developed for the determination of SBET, V0.98, Vmicro.•Feedstock materials were accurately classified by machine learning (ML) models.•IR-based ML models offer an alternative to adsorption isotherms for nanostructured carbons.
Molecular dynamics (MD) is a core methodology of molecular modeling and computational design for the study of the dynamics and temporal evolution of molecular systems. MD simulations have ...particularly benefited from the rapid increase of computational power that has characterized the past decades of computational chemical research, being the first method to be successfully migrated to the GPU infrastructure. While new-generation MD software is capable of delivering simulations on an ever-increasing scale, relatively less effort is invested in developing postprocessing methods that can keep up with the quickly expanding volumes of data that are being generated. Here, we introduce a new idea for sampling frames from large MD trajectories, based on the recently introduced framework of extended similarity indices. Our approach presents a new, linearly scaling alternative to the traditional approach of applying a clustering algorithm that usually scales as a quadratic function of the number of frames. When showcasing its usage on case studies with different system sizes and simulation lengths, we have registered speedups of up to 2 orders of magnitude, as compared to traditional clustering algorithms. The conformational diversity of the selected frames is also noticeably higher, which is a further advantage for certain applications, such as the selection of structural ensembles for ligand docking. The method is available open-source at https://github.com/ramirandaq/MultipleComparisons.
•Sum of ranking differences as multicriteria decision making.•Pareto optimal solutions coupling with ANOVA.•Selected case studies in many subfields of food chemistry:•Chemical analysis, food ...engineering, technology.•Food microbiology, quality control, food sensory analysis.
Finding optimal solutions usually requires multicriteria optimization. The sum of ranking differences (SRD) algorithm can efficiently solve such problems. Its principles and earlier applications will be discussed here, along with meta-analyses of papers published in various subfields of food science, such as analytics in food chemistry, food engineering, food technology, food microbiology, quality control, and sensory analysis. Carefully selected real case studies give an overview of the wide range of applications for multicriteria optimizations, using a free, easy-to-use and validated method. Results are presented and discussed in a way that helps scientists and practitioners, who are less familiar with multicriteria optimization, to integrate the method into their research projects. The utility of SRD, optionally coupled with other statistical methods such as ANOVA, is demonstrated on altogether twelve case studies, covering diverse method comparison and data evaluation scenarios from various subfields of food science.