The data stream problem has been studied extensively in recent years, because of the great ease in collection of stream data. The nature of stream data makes it essential to use algorithms which ...require only one pass over the data. Recently, single-scan, stream analysis methods have been proposed in this context. However, a lot of stream data is high-dimensional in nature. High-dimensional data is inherently more complex in clustering, classification, and similarity search. Recent research discusses methods for projected clustering over high-dimensional data sets. This method is however difficult to generalize to data streams because of the complexity of the method and the large volume of the data streams. In this paper, we propose a new, high-dimensional, projected data stream clustering method, called HPStream. The method incorporates a , and the methodology. It is incrementally updatable and is highly scalable on both the number of dimensions and the size of the data streams, and it achieves better clustering quality in comparison with the previous stream clustering methods. Our performance study with both real and synthetic data sets demonstrates the efficiency and effectiveness of our proposed framework and implementation methods.
Thermodynamic Analysis of Nylon Nucleic Acids Liu, Yu; Wang, Risheng; Ding, Liang ...
Chembiochem : a European journal of chemical biology,
July 2, 2008, Letnik:
9, Številka:
10
Journal Article
Recenzirano
Odprti dostop
The stability and structure of nylon nucleic acid duplexes with complementary DNA and RNA strands was examined. Thermal denaturing studies of a series of oligonucleotides that contained nylon nucleic ...acids (1-5 amide linkages) revealed that the amide linkage significantly enhanced the binding affinity of nylon nucleic acids towards both complementary DNA (up to 26 °C increase in the thermal transition temperature (Tm) for five linkages) and RNA (around 15 °C increase in Tm for five linkages) compared with nonamide linked precursor strands. For both DNA and RNA complements, increasing derivatization decreased the melting temperatures of uncoupled molecules relative to unmodified strands; by contrast, increasing lengths of coupled copolymer raised Tm from less to slightly greater than Tm of unmodified strands. Thermodynamic data extracted from melting curves and CD spectra of nylon nucleic acid duplexes were consistent with loss of stability due to incorporation of pendent groups on the 2'-position of ribose and recovery of stability upon linkage of the side chains.
Most existing large-scale multiview clustering algorithms attempt to capture data distribution in multiple views by selecting view-wise anchor representations beforehand with ...<inline-formula><tex-math notation="LaTeX">k</tex-math></inline-formula>-means, or by direct matrix factorization on the original observations. Despite impressive performance, few of them have paid attention to the semantic correlations between anchor bases and cluster centroids, or even the underlying relations between clusters and data samples. In view of this, we propose a C oncept F actorization based M ultiview C lustering for Large-scale Data (CFMC) method with nearly linear complexity. The anchor bases learning, coefficient expression with clear semantic cues and partitioning are integrated together in this unified model. Meanwhile, explicit connections among multiview data, anchor bases and clusters are modeled via coefficient representations with semantic meanings. A four-step alternate minimizing algorithm is designed to handle the optimization problem, which is proved to have linear time complexity w.r.t. the sample size. Extensive experiments conducted on several challenging large-scale datasets confirm the superiority of the method compared with the state-of-the-art methods.
Next-basket recommendation methods focus on the inference of the next basket by considering the corresponding basket sequence. Although many methods have been developed for the task, they usually ...suffer from data sparsity. The number of interactions between entities is relatively small compared to their huge bases, so it is crucial to mine as much hidden information as possible from the limited historical interactions for prediction. However, the existing methods mainly just treat the next-basket recommendation task as a single-view sequential prediction problem, which leads to the inadequate mining of the information hidden in multiple views, and the mining of other patterns in the historical interactions is neglected, thus making it difficult to learn high-quality representations and limiting the recommendation effect. To alleviate the above issues, we propose a novel method named HapCL for next-basket recommendation, which mines information from multiple views and patterns with the help of polar contrastive learning. A hierarchical module is designed to mine multiple patterns of historical interactions from different views at two levels. In order to mine self-supervised signals, we design a polar contrastive learning module with a novel graph-based augmentation approach. Experiments on three real-world datasets validate the effectiveness of HapCL.
Group Reassignment for Dynamic Edge Partitioning Li, He; Yuan, Hang; Huang, Jianbin ...
IEEE transactions on parallel and distributed systems,
2021-Oct.-1, 2021-10-1, Letnik:
32, Številka:
10
Journal Article
Recenzirano
Odprti dostop
Graph partitioning is a mandatory step in large-scale distributed graph processing. When partitioning real-world power-law graphs, the edge partitioning algorithm performs better than the traditional ...vertex partitioning algorithm, because it can cut a single vertex into multiple replicas to apportion the computation. Many advanced edge partitioning methods are designed for partitioning a static graph from scratch. However, the real-world graph structure changes continuously, which leads to a decrease in partition quality and affects the performance of the graph applications. Some studies are devoted to offline repartitioning or batch incremental partitioning, but how to deal with dynamics in real-time is still worthy of in-depth study. In this article, we discuss the impact of dynamic change on partition and discover that both insertion and deletion will lead to local suboptimal partitioning, which is the reason for the degradation of partition quality. As a solution, a dynamic edge partitioning algorithm is proposed to partition dynamics in real-time. Specifically, we deal with dynamics by a distributed stream and improve partition quality by reassigning some closely connected edges. Experiments show that it is robust to initial partition quality, dynamic scale and type, and distributed scale. Compared with the state-of-the-art dynamic partitioner, it can reduce vertex-cuts by 29.5 percent. Compared with the repartitioning algorithms, it can save the partitioning time by 91.0 percent. Applied on the graph task, it can reduce the increase of communication cost and the increase of the total time of task by 41.5 and 71.4 percent.
Approximate Query Processing in Cube Streams Ming-Jyh Hsieh; Ming-Syan Chen; Yu, P.S.
IEEE transactions on knowledge and data engineering,
11/2007, Letnik:
19, Številka:
11
Journal Article
Recenzirano
Data cubes have become important components in most data warehouse systems and decision support systems. In such systems, users usually pose very complex queries to the online analytical processing ...(OLAP) system, and systems usually have to deal with a huge amount of data because of the large dimensionality of the sets; thus, approximating query processing has emerged as a viable solution. Specifically, the applications of cube streams handle multidimensional data sets in a continuous manner in contrast to the traditional cube approximation. Such an application collects data events for cube streams online, generates snapshots with limited resources, and keeps the approximated information in a synopsis memory for further analysis. Compared to the OLAP applications, applications of cube streams are subject to many more resource constraints on both the processing time and the memory and cannot be dealt with by existing methods due to the limited resources. In this paper, we propose the DAWA algorithm, which is a hybrid algorithm of discrete cosine transform (DCT) for data and the discrete wavelet transform (DWT), to approximate cube streams. Our algorithm combines the advantages of the high compression rate of DWT and the low memory cost of DCT. Consequently, DAWA requires much smaller working buffer and outperforms both DWT-based and DCT-based methods in execution efficiency. Also, it is shown that DAWA provides a good solution for an approximate query processing of cube streams with a small working buffer and a short execution time. The optimality of the DAWA algorithm is theoretically proved and empirically demonstrated by our experiments.
Mining Surprising Periodic Patterns Yang, Jiong; Wang, Wei; Yu, Philip S.
Data mining and knowledge discovery,
09/2004, Letnik:
9, Številka:
2
Journal Article
Recenzirano
In this paper, we focus on mining surprising periodic patterns in a sequence of events. In many applications, e.g., computational biology, an infrequent pattern is still considered very significant ...if its actual occurrence frequency exceeds the prior expectation by a large margin. The traditional metric, such as support, is not necessarily the ideal model to measure this kind of surprising patterns because it treats all patterns equally in the sense that every occurrence carries the same weight towards the assessment of the significance of a pattern regardless of the probability of occurrence. A more suitable measurement, information, is introduced to naturally value the degree of surprise of each occurrence of a pattern as a continuous and monotonically decreasing function of its probability of occurrence. This would allow patterns with vastly different occurrence probabilities to be handled seamlessly. As the accumulated degree of surprise of all repetitions of a pattern, the concept of information gain is proposed to measure the overall degree of surprise of the pattern within a data sequence. The bounded information gain property is identified to tackle the predicament caused by the violation of the downward closure property by the information gain measure and in turn provides an efficient solution to this problem. Furthermore, the user has a choice between specifying a minimum information gain threshold and choosing the number of surprising patterns wanted. Empirical tests demonstrate the efficiency and the usefulness of the proposed model.