Advances in single cell genomics provide a way of routinely generating transcriptomics data at the single cell level. A frequent requirement of single cell expression analysis is the identification ...of novel patterns of heterogeneity across single cells that might explain complex cellular states or tissue composition. To date, classical statistical analysis tools have being routinely applied, but there is considerable scope for the development of novel statistical approaches that are better adapted to the challenges of inferring cellular hierarchies.
We have developed a novel agglomerative clustering method that we call pcaReduce to generate a cell state hierarchy where each cluster branch is associated with a principal component of variation that can be used to differentiate two cell states. Using two real single cell datasets, we compared our approach to other commonly used statistical techniques, such as K-means and hierarchical clustering. We found that pcaReduce was able to give more consistent clustering structures when compared to broad and detailed cell type labels.
Our novel integration of principal components analysis and hierarchical clustering establishes a connection between the representation of the expression data and the number of cell types that can be discovered. In doing so we found that pcaReduce performs better than either technique in isolation in terms of characterising putative cell states. Our methodology is complimentary to other single cell clustering techniques and adds to a growing palette of single cell bioinformatics tools for profiling heterogeneous cell populations.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
The impact of the urban morphology on greenhouse gas emission is one of the key issues on global climate change. Since the urban form is directly related to the spatial distribution of urban land ...use, it is necessary to investigate the relation between carbon emission and different land use categories. In this paper, the city of Eindhoven (230,000 inhabitants) was used as a case study. According to the main road network, the entire city is divided into 6754 irregular patterns. Agglomerative cluster analysis was conducted to classify the patterns into 14 valid land use categories based on their land use function and land cover composition namely: agriculture, transport, retail trade, green space (with 3 sub-categories), residential (with 7 sub-categories), and others. The random forest algorithm was applied to select the significant features and to measure the relation between land use and carbon emission. The results have shown the importance of various landscape metrics on the carbon emission in each land use category. The most significant landscape metric is selected to describe the impact of spatial attributes on carbon emission. The outcomes show the carbon emission distribution of each land use category in the city. The retail trade and residential land use categories contribute a large proportion of carbon emission, terrace houses produce more carbon emission than other residential building categories. The combination of mid-rise buildings and low-rise buildings has a higher probability to produce more carbon emission. The assessment results can provide important support for the low carbon city spatial planning.
•Statistics of landscape metrics shows high reliability on land use classification.•Random forest and hierarchical clustering improve classification performance.•CO2 emission in the residential area is significantly influenced by building layout.•Knowledge of the relation of land use and CO2 emission supports fine scale plan.
The number of partitions identified in a cluster analysis is traditionally a critical point of the procedure. There are many solutions available in the literature that researchers can exploit to ...guide how they determine the number of clusters. However, when a statistical analysis requires repeated cluster analyses, such as when tracking the changing composition of clusters over time, an automated approach can be beneficial. We propose a method to automatically cut dendrograms generated by a hierarchical clustering technique using a novel algorithm called Model-Based Recursive Partitioning. As a case study, the method is applied to dynamically analyze the interdependencies between industry sectors during the pandemic period.
Splitting Methods for Convex Clustering Chi, Eric C.; Lange, Kenneth
Journal of computational and graphical statistics,
10/2015, Letnik:
24, Številka:
4
Journal Article
Recenzirano
Odprti dostop
Clustering is a fundamental problem in many scientific applications. Standard methods such as k-means, Gaussian mixture models, and hierarchical clustering, however, are beset by local minima, which ...are sometimes drastically suboptimal. Recently introduced convex relaxations of k-means and hierarchical clustering shrink cluster centroids toward one another and ensure a unique global minimizer. In this work, we present two splitting methods for solving the convex clustering problem. The first is an instance of the alternating direction method of multipliers (ADMM); the second is an instance of the alternating minimization algorithm (AMA). In contrast to previously considered algorithms, our ADMM and AMA formulations provide simple and unified frameworks for solving the convex clustering problem under the previously studied norms and open the door to potentially novel norms. We demonstrate the performance of our algorithm on both simulated and real data examples. While the differences between the two algorithms appear to be minor on the surface, complexity analysis and numerical experiments show AMA to be significantly more efficient. This article has supplementary materials available online.
•Particular conditions guarantee separability in low dimensional representations.•One-dimensional representation is sufficient for constructing splitting boundaries.•Divisive Hierarchical Clustering ...appropriate for non-linearly separable clusters.
We introduce an approach to divisive hierarchical clustering that is capable of identifying clusters in nonlinear manifolds. This approach uses the isometric mapping (Isomap) to recursively embed (subsets of) the data in one dimension, and then performs a binary partition designed to avoid the splitting of clusters. We provide a theoretical analysis of the conditions under which contiguous and high-density clusters in the original space are guaranteed to be separable in the one-dimensional embedding. To the best of our knowledge there is little prior work that studies this problem. Extensive experiments on simulated and real data sets show that hierarchical divisive clustering algorithms derived from this approach are effective.
•Overstory, midstory, understory/regeneration and groundcover data were collected in over 37,000 nested plots from 73 Florida state parks and 15 ecoregions.•Most communities that were supposed to ...have pine in the overstory did.•Midstory was primarily dominated by non-pines; mainly deciduous hardwoods.•Pine seedlings and saplings were scarce.•Overstory hierarchical clustering grouped some of the similar community types within ecoregions in the same group/subgroup but midstory and regeneration clustering produced mixed results.
Plot based forest/vegetation data is used to establish current conditions and inform natural resources management decisions. The Florida Department of Environmental Protection – Division of Recreation and Parks (DRP) manages 175 state parks. As part of its mission, DRP strives to restore landscapes and natural communities by reintroducing dynamic natural processes such as fire. DRP also uses other methods of active management to achieve desired future conditions (DFCs) in multiple communities including those characterized by longleaf pine (Pinus palustris Mill.) and native groundcover (GC) species. To meet the challenge of managing natural resources across the State, DRP initiated an objective and repeatable forest/vegetation inventory system. The primary objective of this paper was to analyze and summarize the resultant data and compare current community type (ComType) vegetation conditions. Current conditions were quantified using nested plots distributed within sample areas across 73 state parks and 15 ecoregions via the line-plot method. Over 37,000 plots were inventoried; 36% were in pine flatwoods. Ten actively managed and dominant ComTypes were central to this paper and when aggregated by ecoregion, 76 ComType by ecoregion (CTER) groups were examined. Descriptive statistics for various measurements, e.g., diameter at breast height, height, and density, were calculated for overstory, midstory and understory (vegetation layer) separately per CTER. Mean importance value index (IVI) was calculated by species or species-group per vegetation layer and CTER. Community classification was conducted via hierarchical clustering of CTERs per vegetation layer using mean IVI scores. Across CTERs, pine overstory and midstory abundance/stocking levels were generally low, non-pine overstory and midstory stocking levels were high, pine regeneration was sparse, and GC was dominated by leaf litter, pine straw, and non-pine woody seedlings. Results strongly suggest that some ComTypes were similar within certain ecoregions. Various pine species are considered dominant or one of the prominent overstory species for eight of the ten ComTypes. The ability to manage upland pine ComTypes using natural regeneration systems will become increasingly challenging in the near- to medium-term given relatively low overstory pine stocking for most CTERs, and the virtual lack of young pines within midstory and understory layers. Further investigations could help identify potential causal agents concerning low young pine stocking levels, e.g., timing and frequency of prescribed fires, frequency of logging, and/or increased competition from high densities of midstory non-pines and appropriate natural regeneration systems, e.g., seed tree, shelterwood, or group selection, or underplanting longleaf pine, that would accelerate understory pine recruitment.
•We introduce a fast initialisation algorithm for hierarchical clustering.•It significantly reduces the number of iterations in the Ward clustering method.•We also introduce a variant of Ward more ...capable of dealing with noise in data sets.•We carry out several experiments with different noise models to demonstrate it.
In this paper we make two novel contributions to hierarchical clustering. First, we introduce an anomalous pattern initialisation method for hierarchical clustering algorithms, called A-Ward, capable of substantially reducing the time they take to converge. This method generates an initial partition with a sufficiently large number of clusters. This allows the cluster merging process to start from this partition rather than from a trivial partition composed solely of singletons.
Our second contribution is an extension of the Ward and Wardp algorithms to the situation where the feature weight exponent can differ from the exponent of the Minkowski distance. This new method, called A-Wardpβ, is able to generate a much wider variety of clustering solutions. We also demonstrate that its parameters can be estimated reasonably well by using a cluster validity index.
We perform numerous experiments using data sets with two types of noise, insertion of noise features and blurring within-cluster values of some features. These experiments allow us to conclude: (i) our anomalous pattern initialisation method does indeed reduce the time a hierarchical clustering algorithm takes to complete, without negatively impacting its cluster recovery ability; (ii) A-Wardpβ provides better cluster recovery than both Ward and Wardp.
•Hierarchical Cluster Analysis (HCA) is applied for isolation of coal bands.•Multiple regression models are proposed for prediction of coal proximate parameters.•The proximate results are validated ...with the laboratory data.
Coal core samples and well log data of five exploratory wells of Korba Coalfield (CF), India have been used for prediction of coal facies. The Indian non-coking coal lithofacies are generally classified by analyzing the variation of the geophysical log parameters or by defining the ranges of various proximate parameters (mainly ash % and moisture %) obtained from coal core samples. The objective is to classify each layer as coal, shaly coal and shale depending upon the content of ash % and moisture % of the corresponding layer in coaly horizon. Hierarchical Cluster Analysis (HCA) is applied to classify the non-coal horizons and bands of identified coal seams of each well under the study area based on geophysical log responses: natural gamma ray (NG), high resolution density (HRD) and single point resistance (SPR). Hierarchical clustering separates the zones in a particular coal seam from five wells using the nature of the curve. These zones/clusters are further identified as coal, shaly coal, shale in three wells using regression and multilayer feed forward neural network. The log responses and coal core analyzed proximate parameters of these isolated bands/zones in two wells are used for establishing linear regression and neural network models. The observation shows very satisfactory fit (R2=0.84) between ash content and HRD and poor R2(<0.41) between moisture content and log responses. The MLFN model is based on study of two wells using NG, HRD and SPR log responses as inputs and coal proximate parameters, namely, ash and moisture content as outputs to classify the coal lithofacies. The bands within a coal seam are classified on the basis of the ash and moisture content while training as well as the validation of the model. These linear and MLFN models are used to determine the ash % and moisture % in the remaining three testing wells. MLFN predicted results are more closely to the laboratory analyzed proximate parameters as compared to the results obtained from regression modelling.
Artemisia sieberi is widely distributed in the desert and semi-desert regions of Iran. We collected samples from different parts of Iran and proceeded to extract and analyse their essential oils ...using hydrodistillation and GC-MS, respectively. Among seventy-two compounds identified within the oils, the hydrocarbon and oxygenated monoterpenoids, trans-thujone (0.0–22.9%), cis-thujane (0.0–47.3%), 1, 8-cineole (0.7–37.1%), camphor (0.0–46.4%), santolinyl acetate (0–33.8%) and cis-chrysanthenyl acetate (0.0–16.4%) and the sesquiterpenoid, davanone (0.0–59.6%), were reported as the predominant components from the 17-accession. The above-mentioned GC-MS analytical results in conjunction with chemometric calculations, including Principal Component Analysis (PCA) and Hierarchical Clustering Analysis (HCA), suggested 6 chemical groups of A. sieberi collected from the Northern to the Southern parts of Iran. The chemical classification of EOs were based on the sum of concentration of terpenoids with distinct C-skeletons, but not individual constituents. These distinct groups include the species predominant in, I: thujane, II: davanone, III: davanone and bornane, IV: p-menthane, V: bornane and VI: irregular monoterpenoids. The trace or less distributed phytochemicals are also suggested to divide the A. sieberi into two main group of sesquiterpene and monoterpene producers.
•Seventeen essential oil samples of Artemisia sieberi from Iran have been analysed by GC-MS.•Among seventy-two identified compounds mono- and sesquiterpenoids were the main components.•Using PCA and HCA six chemotypes suggested for A. sieberi in different area of Iran.•Trace and less distributed phytochemicals were used to classify plant's essential oils.
Self-organizing maps (SOM) is emerging as an alternative to traditional clustering methods for the hydrochemical analysis of groundwater due to the visualization of high-dimensional data. In this ...study, a combined method of the SOM and hierarchical clustering was applied to analyze the hydrochemical characteristics of groundwater in phreatic aquifer in the Yinchuan basin, China. 154 groundwater samples classified by SOM were projected on 65 neurons and grouped into 6 clusters with hierarchical clustering. The results showed that there exist three principal types of groundwater in the study area, namely high HCO3− type (Cluster-1, 2, and 6), high SO42− type (Cluster-3, and 4), and high Na+ type (Cluster-5). Chadha diagram indicated that the phreatic water in Yinchuan basin mainly belongs to the group of alkaline earths that exceed alkali metals (n = 107, 69%). Rock weathering and evaporation-crystallization are the predominant mechanism in the hydrogeochemical evolution of phreatic groundwater. The present study suggested that the combined method of the SOM and hierarchical clustering provides a reliable approach for interpreting the hydrochemical characteristics of groundwater with high-dimensional data.
Display omitted
•Hydrochemistry of phreatic groundwater was assessed using the SOM and hierarchical clustering.•Hydrochemical facies and evolution mechanism of groundwater were interpreted by Chadha diagram.•Phreatic water in Yinchuan basin belongs to group of alkaline earths that exceed alkali metals.•Groundwater evolution is dominated by rock weathering and evaporation-crystallization.