Principal Component Analysis is a multivariate method to project data in a reduced hyperspace, defined by orthogonal principal components, which are linear combinations of the original variables. In ...this way, data dimension can be reduced, noise can be excluded from the subsequent analysis, and therefore, data interpretation is extremely facilitated. For these reasons, Principal Component Analysis is nowadays the most common chemometric strategy for unsupervised exploratory data analysis.
In this paper, the PCA toolbox for MATLAB is described. This is a collection of modules for calculating Principal Component Analysis, as well as Cluster Analysis and Multidimensional Scaling, which are two other well-known multivariate methods for unsupervised data exploration. The toolbox is freely available via Internet and comprises a graphical user interface (GUI), which allows the calculation in an easy-to-use graphical environment. It aims to be useful for both beginners and advanced users. The use of the toolbox is discussed here with an appropriate practical example.
•The PCA toolbox for MATLAB is a collection of modules freely available via Internet.•The toolbox calculates PCA, Cluster Analysis and Multidimensional Scaling.•An easy-to-use graphical user interface (GUI) environment is available.•Theory of methods, toolbox features, and an example of application are described.
The assessment of the classification performance can be based on class indices, such as sensitivity, specificity and precision, which describe the classification results achieved on each modelled ...class. However, in several situations, it is useful to represent the global classification performance with a single number. Therefore, several measures have been introduced in literature to deal with this assessment, accuracy being the most known and used. These metrics have been proposed to generally face binary classification tasks and can behave differently depending on the classification scenario.
In this study, different global measures of classification performances are compared by means of results achieved on an extended set of real multivariate datasets. The systematic comparison is carried out through multivariate analysis. Further investigations are then derived on specific indices to understand how the presence of unbalanced classes and the number of modelled classes can influence their behaviour. Finally, this work introduces a set of benchmark values based on different random classification scenarios. These benchmark thresholds can serve as the initial criterion to accept or reject a classification model on the basis of its performance.
•A systematic comparison of global measures of classification performances is carried out.•The benchmark values corresponding to random classification are defined for each measure.•Classification measures are compared on an extended number of real multivariate datasets.•Bias related to unbalanced class distributions and number of classes are evaluated.•Numerical results and MATLAB code for the calculation of classification measures are provided.
One of the OECD principles for model validation requires defining the Applicability Domain (AD) for the QSAR models. This is important since the reliable predictions are generally limited to query ...chemicals structurally similar to the training compounds used to build the model. Therefore, characterization of interpolation space is significant in defining the AD and in this study some existing descriptor-based approaches performing this task are discussed and compared by implementing them on existing validated datasets from the literature. Algorithms adopted by different approaches allow defining the interpolation space in several ways, while defined thresholds contribute significantly to the extrapolations. For each dataset and approach implemented for this study, the comparison analysis was carried out by considering the model statistics and relative position of test set with respect to the training space.
Neural networks are rapidly gaining popularity in chemical modeling and Quantitative Structure–Activity Relationship (QSAR) thanks to their ability to handle multitask problems. However, outcomes of ...neural networks depend on the tuning of several hyperparameters, whose small variations can often strongly affect their performance. Hence, optimization is a fundamental step in training neural networks although, in many cases, it can be very expensive from a computational point of view. In this study, we compared four of the most widely used approaches for tuning hyperparameters, namely, grid search, random search, tree-structured Parzen estimator, and genetic algorithms on three multitask QSAR datasets. We mainly focused on parsimonious optimization and thus not only on the performance of neural networks, but also the computational time that was taken into account. Furthermore, since the optimization approaches do not directly provide information about the influence of hyperparameters, we applied experimental design strategies to determine their effects on the neural network performance. We found that genetic algorithms, tree-structured Parzen estimator, and random search require on average 0.08% of the hours required by grid search; in addition, tree-structured Parzen estimator and genetic algorithms provide better results than random search.
We present an optimization of the toroidal self-organizing map (SOM) algorithm for the accurate visualization of hyperspectral data. This represents a significant advancement on our previous work, in ...which we demonstrated the use of toroidal SOMs for the visualization of time-of-flight secondary ion mass spectrometry (ToF-SIMS) imaging data. We have previously shown that the toroidal SOM can be used, unsupervised, to produce a multicolor similarity map of the analysis area, in which pixels with similar mass spectra are assigned a similar color. Here, we use an additional algorithm, relational perspective mapping (RPM), to produce more accurate visualizations of hyperspectral data. The SOM output is used as an input for the RPM algorithm, which is a nonlinear dimensionality reduction technique designed to produce a two-dimensional map of high-dimensional data. Using the topological information provided by the SOM, RPM provides complementary distance information. The result is a color scheme that more accurately reflects the local spectral distances between pixels in the data. We exemplify SOM-RPM using ToF-SIMS imaging data from a mouse tumor tissue section. The similarity maps produced are compared with those produced by two leading hyperspectral visualization techniques in the field of mass spectrometry imaging: t-distributed stochastic neighborhood embedding (t-SNE) and uniform manifold approximation and projection (UMAP). We evaluate the performance of each technique both qualitatively and quantitatively, investigating the correlations between distances in the models and distances in the data. SOM-RPM is demonstrably highly competitive with t-SNE and UMAP, according to our evaluations. Furthermore, the use of a neural network offers distinct advantages in data characterization, which we discuss. We also show how spectra extracted from regions of interest identified by SOM-RPM can be further analyzed using linear discriminant analysis for the validation and characterization of the surface chemistry.
Mass spectrometry (MS) is widely used for the identification of chemical compounds by matching the experimentally acquired mass spectrum against a database of reference spectra. However, this ...approach suffers from a limited coverage of the existing databases causing a failure in the identification of a compound not present in the database. Among the computational approaches for mining metabolite structures based on MS data, one option is to predict molecular fingerprints from the mass spectra by means of chemometric strategies and then use them to screen compound libraries. This can be carried out by calibrating multi-task artificial neural networks from large datasets of mass spectra, used as inputs, and molecular fingerprints as outputs. In this study, we prepared a large LC-MS/MS dataset from an on-line open repository. These data were used to train and evaluate deep-learning-based approaches to predict molecular fingerprints and retrieve the structure of unknown compounds from their LC-MS/MS spectra. Effects of data sparseness and the impact of different strategies of data curing and dimensionality reduction on the output accuracy have been evaluated. Moreover, extensive diagnostics have been carried out to evaluate modelling advantages and drawbacks as a function of the explored chemical space.
Kohonen maps (or Self Organizing Maps, SOMs) and Counterpropagation Artificial Neural Networks are two of the most popular Neural Networks proposed in literature and are increasing in uses and ...applications related to several multivariate chemical issues, since they can handle both supervised and unsupervised problems.
This work deals with the presentation of the Kohonen and CP-ANN toolbox, that is a collection of MATLAB modules freely available via the Internet (
http://www.disat.unimib.it/chm) for the calculation of the quoted models. A graphical user interface (GUI), which allows an easy model calculation and analysis of results, is also provided. The toolbox features are presented by reproducing the classification of a real multivariate dataset. This work is not an attempt to summarize the general applications of Self Organizing Maps, but to inform chemometricians and practitioners who are not skilled programmers of the existence of a user-friendly Matlab toolbox to develop unsupervised and supervised SOM models.
Combinatorial approaches to materials discovery offer promising potential for the rapid development of novel polymer systems. Polymer microarrays enable the high-throughput comparison of material ...physical and chemical propertiessuch as surface chemistry and properties like cell attachment or protein adsorptionin order to identify correlations that can progress materials development. A challenge for this approach is to accurately discriminate between highly similar polymer chemistries or identify heterogeneities within individual polymer spots. Time-of-flight secondary ion mass spectrometry (ToF-SIMS) offers unique potential in this regard, capable of describing the chemistry associated with the outermost layer of a sample with high spatial resolution and chemical sensitivity. However, this comes at the cost of generating large scale, complex hyperspectral imaging data sets. We have demonstrated previously that machine learning is a powerful tool for interpreting ToF-SIMS images, describing a method for color-tagging the output of a self-organizing map (SOM). This reduces the entire hyperspectral data set to a single reconstructed color similarity map, in which the spectral similarity between pixels is represented by color similarity in the map. Here, we apply the same methodology to a ToF-SIMS image of a printed polymer microarray for the first time. We report complete, single-pixel molecular discrimination of the 70 unique homopolymer spots on the array while also identifying intraspot heterogeneities thought to be related to intermixing of the polymer and the pHEMA coating. In this way, we show that the SOM can identify layers of similarity and clusters in the data, both with respect to polymer backbone structures and their individual side groups. Finally, we relate the output of the SOM analysis with fluorescence data from polymer–protein adsorption studies, highlighting how polymer performance can be visualized within the context of the global topology of the data set.
According to the 2021 World Drug Report, around 275 million people use drugs of abuse, and 36 million people suffer from addiction, fostering a thriving market for illicit substances. In Italy, ...30,083 people were reported to the Judicial Authority for offenses in violation of the Italian Law D.P.R. 309/1990. These offences are sentenced after a qualitative-quantitative analysis of seized materials. Given the large quantity of seized drugs and the need to perform accurate analytical determinations, Italian forensic laboratories struggle to complete analyses in a short time, delaying the entire reporting process needed to achieve sentencing. For this purpose, an UHPLC-MS/MS-based platform was developed at the University of Milano-Bicocca to support law-enforcement authorities. Software was designed to easily manage street seizure acquisition, documentation registration, and sampling. A sensitive UHPLC-MS/MS method was fully validated for the quantification of the traditional illicit substances (cocaine, heroin, 6-MAM, morphine, amphetamine, methamphetamine, MDMA, ketamine, GHB, GBL, LSD, trans-∆9-THC, and THCA) at the ppb level. The final report is relayed to the Prefecture in 3-4 days, even within 24 h for urgent requests. The platform allows for semi-automatic data handling to minimize erroneous results for an accurate report generation by standardized procedures.