The European REACH regulation requires information on ready biodegradation, which is a screening test to assess the biodegradability of chemicals. At the same time REACH encourages the use of ...alternatives to animal testing which includes predictions from quantitative structure–activity relationship (QSAR) models. The aim of this study was to build QSAR models to predict ready biodegradation of chemicals by using different modeling methods and types of molecular descriptors. Particular attention was given to data screening and validation procedures in order to build predictive models. Experimental values of 1055 chemicals were collected from the webpage of the National Institute of Technology and Evaluation of Japan (NITE): 837 and 218 molecules were used for calibration and testing purposes, respectively. In addition, models were further evaluated using an external validation set consisting of 670 molecules. Classification models were produced in order to discriminate biodegradable and nonbiodegradable chemicals by means of different mathematical methods: k nearest neighbors, partial least squares discriminant analysis, and support vector machines, as well as their consensus models. The proposed models and the derived consensus analysis demonstrated good classification performances with respect to already published QSAR models on biodegradation. Relationships between the molecular descriptors selected in each QSAR model and biodegradability were evaluated.
Kohonen maps and Counterpropagation Neural Networks are two of the most popular learning strategies based on Artificial Neural Networks. Kohonen Maps (or Self Organizing Maps) are basically ...self-organizing systems which are capable to solve the unsupervised rather than the supervised problems, while Counterpropagation Artificial Neural Networks are very similar to Kohonen maps, but an output layer is added to the Kohonen layer in order to handle supervised modelling. Recently, the modifications of Counterpropagation Artificial Neural Networks allowed introducing new supervised neural network strategies, such as Supervised Kohonen Networks and XY-fused Networks.
In this paper, the Kohonen and CP-ANN toolbox for MATLAB is described. This is a collection of modules for calculating Kohonen maps and derived methods for supervised classification, such as Counterpropagation Artificial Neural Networks, Supervised Kohonen Networks and XY-fused Networks. The toolbox comprises a graphical user interface (GUI), which allows the calculation in an easy-to-use graphical environment. It aims to be useful for both beginners and advanced users of MATLAB. The use of the toolbox is discussed here with an appropriate practical example.
► This toolbox allows the calculation of Kohonen maps and derived supervised methods. ► CPANN, SKN, and XYF Artificial Neural Networks can be calculated. ► The toolbox comprises a graphical user interface (GUI). ► The GUI allows the calculation in an easy-to-use graphical environment.
Consensus strategies have been widely applied in many different scientific fields, based on the assumption that the fusion of several sources of information increases the outcome reliability. Despite ...the widespread application of consensus approaches, their advantages in quantitative structure–activity relationship (QSAR) modeling have not been thoroughly evaluated, mainly due to the lack of appropriate large-scale data sets. In this study, we evaluated the advantages and drawbacks of consensus approaches compared to single classification QSAR models. To this end, we used a data set of three properties (androgen receptor binding, agonism, and antagonism) for approximately 4000 molecules with predictions performed by more than 20 QSAR models, made available in a large-scale collaborative project. The individual QSAR models were compared with two consensus approaches, majority voting and the Bayes consensus with discrete probability distributions, in both protective and nonprotective forms. Consensus strategies proved to be more accurate and to better cover the analyzed chemical space than individual QSARs on average, thus motivating their widespread application for property prediction. Scripts and data to reproduce the results of this study are available for download.
Multivariate regression is a fundamental supervised chemometric approach that defines the relationship between a set of independent variables and a quantitative response. It enables the subsequent ...prediction of the response for future samples, thus avoiding its experimental measurement. Regression approaches have been widely applied for data analysis in different scientific fields.
In this paper, we describe the regression toolbox for MATLAB, which is a collection of modules for calculating some well-known regression methods: Ordinary Least Squares (OLS), Partial Least Squares (PLS), Principal Component Regression (PCR), Ridge and local regression based on sample similarities, such as Binned Nearest Neighbours (BNN) and k-Nearest Neighbours (kNN) regression methods. Moreover, the toolbox includes modules to couple regression approaches with supervised variable selection based on All Subset models, Forward Selection, Genetic Algorithms and Reshaped Sequential Replacement. The toolbox is freely available at the Milano Chemometrics and QSAR Research Group website and provides a graphical user interface (GUI), which allows the calculation in a user-friendly graphical environment.
•The regression toolbox for MATLAB is a collection of modules freely available via internet.•The toolbox calculates major regression approaches (OLS, PCR, PLS, ridge and similarity based).•Regression can be coupled with supervised variable selection.•An easy-to-use graphical user interface (GUI) environment is available.
Approaches of high-level data fusion, also known as consensus, combine predictions of individual models to increase reliability and overcome limitations of single models. Consensus strategies are ...frequently applied in the framework of Quantitative Structure - Activity Relationships (QSARs) to reduce the uncertainties in the prediction of molecular activities and provide better accuracy of the model outcomes. However, specific regions of the chemical space may systematically be associated with low accuracy and even consensus modelling cannot improve prediction reliability through the multiple outcomes of individual models.
In this study, a new heuristic metric to assess the degree of accuracy of consensus predictions in the chemical space is proposed. This metric can assist the mapping of reliability in prediction and enhance the delineation of a safe zone, where consensus predictions are expected to have better accuracy. The new metric is calculated by kernel-based potential functions and it can be used in the framework of both classification and regression consensus modelling. Four case studies, including extensive datasets for consensus modelling, were used to test the proposed approach.
Results demonstrated that a potential can be associated with regions of the chemical space as a function of accuracy of consensus modelling and it can be used to enable the mapping of reliability in prediction and the definition of specific regions where predictions are expected to be more reliable.
•Consensus (high level data fusion) can reduce uncertainties in predictions.•Areas of the chemical space may systematically be related with low accuracy.•We propose a new metric to assess the degree of accuracy of consensus predictions.•Four extensive datasets for consensus modelling were used to test it.•Results demonstrated that kernel-based potential can map prediction reliability.
Nuclear receptors (NRs) are key regulators of human health and constitute a relevant target for medicinal chemistry applications as well as for toxicological risk assessment. Several open databases ...dedicated to small molecules that modulate NRs exist; however, depending on their final aim (i.e., adverse effect assessment or drug design), these databases contain a different amount and type of annotated molecules, along with a different distribution of experimental bioactivity values. Stemming from these considerations, in this work we aim to provide a unified dataset, NURA (NUclear Receptor Activity) dataset, collecting curated information on small molecules that modulate NRs, to be intended for both pharmacological and toxicological applications. NURA contains bioactivity annotations for 15,247 molecules and 11 selected NRs, and it was obtained by integrating and curating data from toxicological and pharmacological databases (i.e., Tox21, ChEMBL, NR-DBIND and BindingDB). Our results show that NURA dataset is a useful tool to bridge the gap between toxicology- and medicinal-chemistry-related databases, as it is enriched in terms of number of molecules, structural diversity and covered atomic scaffolds compared to the single sources. To the best of our knowledge, NURA dataset is the most exhaustive collection of small molecules annotated for their modulation of the chosen nuclear receptors. NURA dataset is intended to support decision-making in pharmacology and toxicology, as well as to contribute to data-driven applications, such as machine learning. The dataset and the data curation pipeline can be downloaded free of charge on Zenodo at the following DOI: https://doi.org/10.5281/zenodo.3991561.
Hyphenated chromatography is among the most popular analytical techniques in omics related research. While great advancements have been achieved on the experimental side, the same is not true for the ...extraction of the relevant information from chromatographic data. Extensive signal preprocessing is required to remove the signal of the baseline, resolve the time shifts of peaks from sample to sample and to properly estimate the spectra and concentrations of co-eluting compounds.
Among several available strategies, curve resolution approaches, such as PARAFAC2, ease the deconvolution and the quantification of chemicals. However, not all resolved profiles are relevant. For example, some take into account the baseline, others the chemical compounds. Thus, it is necessary to distinguish the profiles describing relevant chemistry. With the aim to assist researchers in this selection phase, we have tried three different classification algorithms (convolutional and recurrent neural networks, k-nearest neighbours) for the automatic identification of GC-MS elution profiles resolved by PARAFAC2.
To this end, we have manually labelled more than 170,000 elution profiles in the following four classes: ‘Peak’, ‘Cutoff peak’,’ Baseline’ and ‘Others’ in order to train, validate and test the classification models.
The results highlight two main points: i) neural networks seem to be the best solution for this specific classification task confirmed by the overall quality of the classification, ii) the quality of the input data is crucial to maximize the modelling performances.
•A new approach to automatically label curve-resolved elution profiles has been developed.•170.000 elution profiles labelled.•Three difference classification methods developed.•Results significantly improved compared to state-of-the-art.
Ranking and multi-criteria decision-making approaches are useful tools to analyse multivariate data and obtain useful insights into data structure and the relationships between samples and variables. ...In this study, we present a new ranking approach, named Deep Ranking Analysis by Power Eigenvectors (DRAPE), which is based on the Power-Weakness Ratio analysis and provides a set of sequential rankings. Such a sequential ranking procedure allows to gather deeper insights into the analysed dataset. Moreover, by a “retro”-regression procedure, the relevance of each variable in determining the final rankings can be assessed, while a consensus ranking can be obtained by a Principal Component Analysis (PCA). In this study, we present the theory of the novel method, and show three applications to real datasets.
Display omitted
•DRAPE is a ranking method based on the Power-Weakness Ratio.•Different thresholds are automatically selected from the tournament table.•For each threshold a ranking is obtained.•Regression between PWR ranks and variables provides insight to ranking results.•A consensus analysis can be performed on the whole set of rankings.
Display omitted
•Comprehensive review of ligand-based classifiers to predict molecular taste.•Machine learning models cover a period between 1980 and 2022.•Fifty-two reported studies were categorized ...into six broad categories.•Classifiers to predict sweetness and bitterness are the most predominant.•Prediction of umaminess and sourness is an emerging topic.
The capacity to discriminate safe from dangerous compounds has played an important role in the evolution of species, including human beings. Highly evolved senses such as taste receptors allow humans to navigate and survive in the environment through information that arrives to the brain through electrical pulses. Specifically, taste receptors provide multiple bits of information about the substances that are introduced orally. These substances could be pleasant or not according to the taste responses that they trigger. Tastes have been classified into basic (sweet, bitter, umami, sour and salty) or non-basic (astringent, chilling, cooling, heating, pungent), while some compounds are considered as multitastes, taste modifiers or tasteless. Classification-based machine learning approaches are useful tools to develop predictive mathematical relationships in such a way as to predict the taste class of new molecules based on their chemical structure. This work reviews the history of multicriteria quantitative structure-taste relationship modelling, starting from the first ligand-based (LB) classifier proposed in 1980 by Lemont B. Kier and concluding with the most recent studies published in 2022.