We analyze the drawbacks of DBSCAN and its variants, and find the grid technique, which is used in Fast-DBSCAN and ρ-approximate DBSCAN, is almost useless in high dimensional data space. Because it ...usually yields considerable redundant distance computations. In order to tame these problems, two techniques are proposed: one is to use ϵ2-norm ball to identify Inner Core Blocks within which all points are core points, it has higher efficiency than grid technique for finding more core points at one time; the other is a fast approximate algorithm for judging whether two Inner Core Blocks are density-reachable from each other. Besides, cover tree is also used to accelerate the process of density computations. Based on the three techniques, an approximate approach, namely BLOCK-DBSCAN, is proposed for large scale data, which runs in about O(nlog (n)) expected time and obtains almost the same result as DBSCAN. BLOCK-DBSCAN has two versions, i.e., L2 version can work well for relatively high dimensional data, and L∞ version is suitable for high dimensional data. Experimental results show that BLOCK-DBSCAN is promising and outperforms NQDBSCAN, ρ-approximate DBSCAN and AnyDBC.
KNN-BLOCK DBSCAN: Fast Clustering for Large-Scale Data Chen, Yewang; Zhou, Lida; Pei, Songwen ...
IEEE transactions on systems, man, and cybernetics. Systems,
2021-June, 2021-6-00, Letnik:
51, Številka:
6
Journal Article
Recenzirano
Large-scale data clustering is an essential key for big data problem. However, no current existing approach is "optimal" for big data due to high complexity, which remains it a great challenge. In ...this article, a simple but fast approximate DBSCAN, namely, KNN-BLOCK DBSCAN, is proposed based on two findings: 1) the problem of identifying whether a point is a core point or not is, in fact, a kNN problem and 2) a point has a similar density distribution to its neighbors, and neighbor points are highly possible to be the same type (core point, border point, or noise). KNN-BLOCK DBSCAN uses a fast approximate kNN algorithm, namely, FLANN, to detect core-blocks (CBs), noncore-blocks, and noise-blocks within which all points have the same type, then a fast algorithm for merging CBs and assigning noncore points to proper clusters is also invented to speedup the clustering process. The experimental results show that KNN-BLOCK DBSCAN is an effective approximate DBSCAN algorithm with high accuracy, and outperforms other current variants of DBSCAN, including <inline-formula> <tex-math notation="LaTeX">\rho </tex-math></inline-formula>-approximate DBSCAN and AnyDBC.
Centroid-based clustering approaches fail to recognize extremely complex patterns that are non-isotropic. We analyze the underlying causes and find some inherent flaws in these approaches, including ...Shape Loss, False Distances and False Peaks, which typically cause centroid-based approaches to fail when applied to complex patterns. As an alternative to current methods, we propose a hybrid decentralized approach named DCore, which is based on finding density cores instead of centroids, to overcome these flaws. The underlying idea is that we consider each cluster to have a shrunken density core that roughly retains the shape of the cluster. Each such core consists of a set of loosely connected local density peaks of higher density than their surroundings. Borders, edges and outliers are distributed around the outsides of these cores in a hierarchical structure. Experiments demonstrate that the promise of DCore lies in its power to recognize extremely complex patterns and its high performance in real applications, for example, image segmentation and face clustering, regardless of the dimensionality of the space in which the data are embedded.
We present two new neighbor query algorithms, including range query (RNN) and nearest neighbor (NN) query, based on revised k-d tree by using two techniques. The first technique is proposed for ...decreasing unnecessary distance computations by checking whether the cell of a node is inside or outside the specified neighborhood of query point, and the other is used to reduce redundant visiting nodes by saving the indices of descendant points. We also implement the proposed algorithms in Matlab and C. The Matlab version is to improve original RNN and NN which are based on k-d tree, C version is to improve k-Nearest neighbor query (kNN) which is based on buffer k-d tree. Theoretical and experimental analysis have shown that the proposed algorithms significantly improve the original RNN, NN and kNN in low dimension, respectively. The tradeoff is that the additional space cost of the revised k-d tree is approximately O(αnlog (n)).
This study aimed to find coronary artery disease (CAD) related apolipoprotein A1 (ApoA1) monoclonal antibody (mAb) and to evaluate the diagnostic value of the assay based on it.
Patients with CAD ...diagnosed by coronary angiography (disease group, n = 180) and healthy subjects (control group, n = 199) were recruited. The correlation between methods and CAD were evaluated by Spearman's rank correlation coefficients. Receiver operating characteristic (ROC) curve analysis was used to evaluate the auxiliary diagnostic value of methods for CAD. Odds ratios (ORs) of the test results in CAD were estimated using logistic regression analysis.
Measurements from an ApoA1 mAb were found significantly positively correlated with CAD (r = 0.243, P < 0.01), unlike the measurements from the ApoA1 pAb were negatively correlated with CAD (r = -0.341, P < 0.001). The areas under the ROC curve of the ApoA1 mAb and pAb measurements were 0.704 and 0.563, respectively, in patients with normal HDL-C levels. ApoA1 values from the mAb assay had a significant positive impact on CAD risk.
An ApoA1 mAb-based assay can distinguish a high-density lipoprotein (HDL) subclass positively related to CAD, which can be used to improve and reappraise CAD risk assessment.
We present a fast range search algorithm, which greatly reduces unnecessary distance computations, based on a technique to prune redundant distance computations. Theoretical and experimental analysis ...have shown that the proposed algorithm significantly improves the original k-D tree based algorithm, which runs in O(log(n)) time either in low dimension or the searching range is small. In the case where the searching range is large enough or it doesn't intersect with the data space, the proposed algorithm runs in O(1) time. The tradeoffs is that it costs extra distance computations for finding the maximum or minimum distance from a point to a hyper rectangle.
Although B lymphocytes are widely known to participate in the immune response, the conclusive roles of B lymphocyte subsets in the antitumor immune response have not yet been determined. Single-cell ...data from GEO datasets were first analyzed, and then a B cell flow cytometry panel was used to analyze the peripheral blood of 89 HCC patients and 33 healthy controls recruited to participate in our research. Patients with HCC had a higher frequency of B10 cells and a lower percentage of MZB cells than healthy controls. And the changes in B cell subsets might occur at an early stage. Moreover, the frequency of B10 cells decreased after surgery. Positively correlated with B10 cells, the elevated IL-10 level in HCC serum may be a new biomarker in HCC identification. For the first time, our results suggest that altered B cell subsets are associated with the development and prognosis of HCC. Increased B10 cell percentage and IL-10 in HCC patients suggest they might augment the development of liver tumors. Hence, B cell subsets and related cytokines may have predictive value in HCC patients and could be potential targets for immunotherapy in HCC.
A fast exact nearest neighbor search algorithm over large scale data is proposed based on semi-convex hull tree, where each node represents a semi-convex hull, which is made of a set of hyper planes. ...When performing the task of nearest neighbor queries, unnecessary distance computations can be greatly reduced by quadratic programming. GPUs are also used to accelerate the query process. Experiments conducted on both Intel(R) HD Graphics 4400 and Nvidia Geforce GTX1050 TI, as well as theoretical analysis show that the proposed algorithm yields significant improvements and outperforms current k-d tree based nearest neighbor query algorithms and others.
Dear Editor,Plants have evolved great plasticity to adapt to external environments. A huge number of structurally diverse metabolites gener- ated through the glycosylation process is one potential ...mechanism that contributes to this plasticity (Bowles et al., 2005).