Ever since its invention half a century ago, flow cytometry has been a major tool for single‐cell analysis, fueling advances in our understanding of a variety of complex cellular systems, in ...particular the immune system. The last decade has witnessed significant technical improvements in available cytometry platforms, such that more than 20 parameters can be analyzed on a single‐cell level by fluorescence‐based flow cytometry. The advent of mass cytometry has pushed this limit up to, currently, 50 parameters. However, traditional analysis approaches for the resulting high‐dimensional datasets, such as gating on bivariate dot plots, have proven to be inefficient. Although a variety of novel computational analysis approaches to interpret these datasets are already available, they have not yet made it into the mainstream and remain largely unknown to many immunologists. Therefore, this review aims at providing a practical overview of novel analysis techniques for high‐dimensional cytometry data including SPADE, t‐SNE, Wanderlust, Citrus, and PhenoGraph, and how these applications can be used advantageously not only for the most complex datasets, but also for standard 14‐parameter cytometry datasets.
Display omitted
•A new soybean screening method was developed that does not require calibration curves.•The excitation emission matrix (EEM) of soybean powder samples was measured.•Principal ...component analysis (PCA) and t-SNE were adopted for dimenality reduction.•Chemical contents-based screening was more accurate by t-SNE than PCA.•Screening accuracy based on isoflavone content reached 81.4% by t-SNE.
Measuring the chemical composition in soybeans is time-consuming and laborious, and even simple near-infrared sensors generally require the creation of calibration curves before application. In this study, a new screening method for soybeans without calibration curves was investigated by combining the excitation emission matrix (EEM) and dimensionality reduction analysis. The EEMs of 34 soybean samples were measured, and representative chemical contents including crude protein, crude oil and isoflavone contents were measured by chemical analysis. Two methods of dimensionality reduction: principal component analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) were applied on the EEM data to obtain two-dimensional plots, which were divided into two regions with large or small amount of each chemical components. To classify the large or small levels of each of the chemical composition, machine learning classification models were constructed on the two-dimensional plots after dimensionality reduction. As a result, the classification accuracy was higher in t-SNE than in the combinations of PC1 and PC2 from PCA. Furthermore, in t-SNE, the classification accuracy reached over 90% for all the chemical components. From these results, t-SNE dimensionality reduction on the soybean EEM has the potential for easy and accurate screening of soybeans especially based on isoflavone contents.
As the size and complexity of high‐dimensional (HD) cytometry data continue to expand, comprehensive, scalable, and methodical computational analysis approaches are essential. Yet, contemporary ...clustering and dimensionality reduction tools alone are insufficient to analyze or reproduce analyses across large numbers of samples, batches, or experiments. Moreover, approaches that allow for the integration of data across batches or experiments are not well incorporated into computational toolkits to allow for streamlined workflows. Here we present Spectre, an R package that enables comprehensive end‐to‐end integration and analysis of HD cytometry data from different batches or experiments. Spectre streamlines the analytical stages of raw data pre‐processing, batch alignment, data integration, clustering, dimensionality reduction, visualization, and population labelling, as well as quantitative and statistical analysis. Critically, the fundamental data structures used within Spectre, along with the implementation of machine learning classifiers, allow for the scalable analysis of very large HD datasets, generated by flow cytometry, mass cytometry, or spectral cytometry. Using open and flexible data structures, Spectre can also be used to analyze data generated by single‐cell RNA sequencing or HD imaging technologies, such as Imaging Mass Cytometry. The simple, clear, and modular design of analysis workflows allow these tools to be used by bioinformaticians and laboratory scientists alike. Spectre is available as an R package or Docker container. R code is available on Github (https://github.com/immunedynamics/spectre).
Given the intricate, multifaceted nature of financial data in e-commerce enterprises, this article presents a T-DPC algorithm for analyzing financial management in these businesses. The algorithm ...utilizes the t-SNE method to reduce the dimensionality of financial data, whilst also implementing an enhanced DPC algorithm based on the K-nearest neighbor concept to analyze financial data clusters. The results show that the F-measure metrics of the DPC algorithm optimized by t-SNE improve 16.7% and 3.07% over the DPC algorithm after testing on the PID and Wine datasets, and its running time is faster than the DPC algorithm on the Aggregation, D31, and R15 datasets by 16.2. Therefore, the algorithm has reference significance for the financial analysis of e-commerce enterprises.
The explainability of manifold learning is rarely investigated though there is an urgent need from both AI theory and practice. In this study, we propose a novel degree of locality preservation (DLP) ...approach to study the interpretability of manifold learning. We estimate the DLPs of the state-of-the-art manifold learning methods: t-SNE and UMAP as well as related methods: LLE, HLLE, and LTSA along with widely used PCA across benchmark datasets classified as low-dimensional and high-dimensional data.
Our study provides well-founded explanations of the manifold learning methods in terms of the DLPs. The order of their DLPs follows t-SNE> UMAP> LLE> HLLE/PCA/LTSA, though it may have some exceptions for some high-dimensional data. Both t-SNE and UMAP demonstrate an embedding distance amplification mechanism under the Euclidean distance that forces the latent local data geometry to stand out in dimension reduction. It not only explains why t-SNE and UMAP have higher DLPs than other peers, but also indicates they are not locally isometric under the Euclidean distance. Furthermore, it discovers that t-SNE and UMAP embeddings demonstrate similar nonlinear nature in dimension reduction, besides larger (smaller) data variances for low (high)-dimensional data. To the best of our knowledge, this study is the first work about the explainability of manifold learning. The proposed methods and corresponding results can be also extended to other dimension reduction techniques.
t-Distributed Stochastic Neighbor Embedding (t-SNE) for the visualization of multidimensional data has proven to be a popular approach, with successful applications in a wide range of domains. ...Despite their usefulness, t-SNE projections can be hard to interpret or even misleading, which hurts the trustworthiness of the results. Understanding the details of t-SNE itself and the reasons behind specific patterns in its output may be a daunting task, especially for non-experts in dimensionality reduction. In this article, we present t-viSNE, an interactive tool for the visual exploration of t-SNE projections that enables analysts to inspect different aspects of their accuracy and meaning, such as the effects of hyper-parameters, distance and neighborhood preservation, densities and costs of specific neighborhoods, and the correlations between dimensions and visual patterns. We propose a coherent, accessible, and well-integrated collection of different views for the visualization of t-SNE projections. The applicability and usability of t-viSNE are demonstrated through hypothetical usage scenarios with real data sets. Finally, we present the results of a user study where the tool's effectiveness was evaluated. By bringing to light information that would normally be lost after running t-SNE, we hope to support analysts in using t-SNE and making its results better understandable.
Coronaviruses infect many animals, including humans, due to interspecies transmission. Three of the known human coronaviruses: MERS, SARS-CoV-1, and SARS-CoV-2, the pathogen for the COVID-19 ...pandemic, cause severe disease. Improved methods to predict host specificity of coronaviruses will be valuable for identifying and controlling future outbreaks. The coronavirus S protein plays a key role in host specificity by attaching the virus to receptors on the cell membrane. We analyzed 1238 spike sequences for their host specificity. Spike sequences readily segregate in t-SNE embeddings into clusters of similar hosts and/or virus species. Machine learning with SVM, Logistic Regression, Decision Tree, Random Forest gave high average accuracies, F1 scores, sensitivities and specificities of 0.95–0.99. Importantly, sites identified by Decision Tree correspond to protein regions with known biological importance. These results demonstrate that spike sequences alone can be used to predict host specificity.
•Coronavirus spike protein sequences can predict host specificity.•Machine learning had accuracies of >0.98 for classification of human vs non-human hosts.•Decision Tree classifier identified spike regions with important biological roles .•Clustering with t-SNE embeddings correctly segregated sequences by virus genus and host .
Novel non-parametric dimensionality reduction techniques such as t-distributed stochastic neighbor embedding (t-SNE) lead to a powerful and flexible visualization of high-dimensional data. One ...drawback of non-parametric techniques is their lack of an explicit out-of-sample extension. In this contribution, we propose an efficient extension of t-SNE to a parametric framework, kernel t-SNE, which preserves the flexibility of basic t-SNE, but enables explicit out-of-sample extensions. We test the ability of kernel t-SNE in comparison to standard t-SNE for benchmark data sets, in particular addressing the generalization ability of the mapping for novel data. In the context of large data sets, this procedure enables us to train a mapping for a fixed size subset only, mapping all data afterwards in linear time. We demonstrate that this technique yields satisfactory results also for large data sets provided missing information due to the small size of the subset is accounted for by auxiliary information such as class labels, which can be integrated into kernel t-SNE based on the Fisher information.