Data streams are a potentially unbounded sequence of data objects, and the clustering of such data is an effective way of identifying their underlying patterns. Existing data stream clustering ...algorithms face two critical issues: 1) evaluating the relationship among data objects with individual landmark windows of fixed size and 2) passing useful knowledge from previous landmark windows to the current landmark window. Based on sparse representation techniques, this article proposes a two-stage sparse representation clustering (TSSRC) method. The novelty of the proposed TSSRC algorithm comes from evaluating the effective relationship among data objects in the landmark windows with an accurate number of clusters. First, the proposed algorithm evaluates the relationship among data objects using sparse representation techniques. The dictionary and sparse representations are iteratively updated by solving a convex optimization problem. Second, the proposed TSSRC algorithm presents a dictionary initialization strategy that seeks representative data objects by making full use of the sparse representation results. This efficiently passes previously learned knowledge to the current landmark window over time. Moreover, the convergence and sparse stability of TSSRC can be theoretically guaranteed in continuous landmark windows under certain conditions. Experimental results on benchmark datasets demonstrate the effectiveness and robustness of TSSRC.
Privacy-preserving trajectory stream publishing Al-Hussaeni, Khalil; Fung, Benjamin C.M.; Cheung, William K.
Data & knowledge engineering,
November 2014, 2014-11-00, Volume:
94
Journal Article
Peer reviewed
Recent advancement in mobile computing and sensory technology has facilitated the possibility of continuously updating, monitoring, and detecting the latest location and status of moving individuals. ...Spatio-temporal data generated and collected on the fly are described as trajectory streams. This work is motivated by the concern that publishing individuals' trajectories on the fly may jeopardize their privacy. In this paper, we illustrate and formalize two types of privacy attacks against moving individuals. We devise a novel algorithm, called Incremental Trajectory Stream Anonymizer (ITSA), for incrementally anonymizing a sequence of sliding windows on trajectory stream. The sliding windows are dynamically updated with joining and leaving individuals. The sliding windows are updated by using an efficient data structure to accommodate massive volume of data. We conducted extensive experiments on simulated and real-life data sets to evaluate the performance of our method. Empirical results demonstrate that our method significantly lowers runtime compared to existing methods, and efficiently scales when handling massive data sets. To the best of our knowledge, this is the first work to anonymize high-dimensional trajectory stream.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK
Influence maximization (IM), which selects a set of
k
seed users (a.k.a.,
a seed set
) to maximize the influence spread over a social network, is a fundamental problem in a wide range of ...applications. However, most existing IM algorithms are static and location-unaware. They fail to provide high-quality seed sets efficiently when the social network evolves rapidly and IM queries are location-aware. In this article, we first define two IM queries, namely
Stream Influence Maximization
(SIM) and
Location-aware SIM
(LSIM), to track influential users over social streams. Technically, SIM adopts the sliding window model and maintains a seed set with the maximum influence value collectively over the most recent social actions. LSIM further considers social actions are associated with
geo-tags
and identifies a seed set that maximizes the influence value in a query region over a location-aware social stream. Then, we propose the
Sparse Influential Checkpoints
(SIC) framework for efficient SIM query processing. SIC maintains a sequence of
influential checkpoints
over the sliding window and each checkpoint maintains a partial solution for SIM in an append-only substream of social actions. Theoretically, SIC keeps a logarithmic number of checkpoints w.r.t. the size of the sliding window and always returns an approximate solution from one of the checkpoint for the SIM query at any time. Furthermore, we propose the
Location-based SIC
(LSIC) framework and its improved version LSIC
+
, both of which process LSIM queries by integrating the SIC framework with a Quadtree spatial index. LSIC can provide approximate solutions for both ad hoc and continuous LSIM queries in real time, while LSIC
+
further improves the solution quality of LSIC. Experimental results on real-world datasets demonstrate the effectiveness and efficiency of the proposed frameworks against the state-of-the-art IM algorithms.
The IoT-enabled smart grid system provides smart meter data for electricity consumers to record their energy consumption behaviors, the typical features of which can be represented by the load ...patterns extracted from load data clustering. The changeability of consumption behaviors requires load pattern update for achieving accurate consumer segmentation and effective demand response. In order to save training time and reduce computation scale, we propose a novel incremental clustering algorithm with probability strategy, ICluster-PS, instead of overall load data clustering to update load patterns. ICluster-PS first conducts new load pattern extraction based on the existing load patterns and new data. Then, it intergrades new load patterns with the existing ones. Finally, it optimizes the intergraded load pattern sets by a further modification. Moreover, ICluster-PS can be performed continuously with new coming data due to parameter updating and generalization. Extensive experiments are implemented on real-world dataset containing diverse consumer types in various districts. The experimental results are evaluated by both clustering validity indices and accuracy measures, which indicate that ICluster-PS outperforms other related incremental clustering algorithm. Additionally, according to the further case studies on pattern evolution analysis, ICluster-PS is able to present any pattern drifts through its incremental clustering results.
Full text
Available for:
IZUM, KILJ, NUK, PILJ, PNG, SAZU, UL, UM, UPUK
The diversity of a voting committee is one of the key characteristics of ensemble systems. It determines the benefits that can be obtained through classifier fusion. There are many measures of ...diversity that can be used in classical decision-making systems which operate in stationary environments. A plethora of algorithms have also been proposed to ensure ensemble diversity. Bagging and boosting are a few of the most popular examples. Unfortunately, these measures and algorithms cannot be applied in systems that process streaming data. Not only must a different implementation be designed for processing fast moving samples in a stream, but the notion of diversity must also be redefined. In this paper it is proposed to assess diversity based on analysis of classifier reactions to changes in data streams. Therefore, two novel error trend diversity measures are introduced that compare the error trends of classifiers while processing subsequent samples. A practical application of these measures is also proposed in the form of a novel error trend diversity driven ensemble algorithm, where our measures are incorporated into the training procedure. The performance of the proposed algorithm is evaluated through a series of experiments and compared to several competing methods. The results demonstrate that our measures accurately evaluate diversity and that their application facilitates the creation of small and effective ensemble classifier systems.
•A new diversity measure is proposed for ensembles that classify data streams.•The measure evaluates error trends of the classifiers in the ensemble.•A new ensemble training algorithm is proposed with hybrid target function.•It aims to reduce classification error and maximise ensemble diversity.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK, ZRSKP
With the development of avionics industry, it is difficult for traditional combat equipment node evaluation method to meet our requirements under complex combat system. This paper presents a method ...of node importance evaluation which is suitable for modern avionics field and can be used for reference in other combat fields. In order to make better use of the different features of the node itself and the different connections between nodes, we use TOPSIS algorithm to model the characteristics of the node itself, and PageRank to measure the interdependence of all nodes. Therefore, a novel node contribution evaluation algorithm based on TOPSIS and PageRank is proposed in this paper. In addition, after the evaluation of node contribution, we found that there was also a functional relationship between the operational information entropy in the whole graph and the contribution of these nodes. On this basis, information entropy evaluation algorithm of the overall combat map is further proposed. After a lot of experiments, the reliability of our algorithm is evaluated on the indexes of the node's destruction-resistant performance and information transfer efficiency. Compared with the traditional universal algorithm, our proposed algorithm shows more interpretable and robust results in the field of avionics.
Many daily applications are generating massive amount of data in the form of stream at an ever higher speed, such as medical data, clicking stream, internet record and banking transaction, etc. In ...contrast to the traditional static data, data streams are of some inherent properties, to name a few, infinite length, concept drift, multiple labels and concept evolution. Among all the data mining tasks, classification is one of the basic topics in data stream mining and has gained more and more attentions among different research communities. Extreme Learning Machine (ELM) has drawn much interests in data classification due to its high efficiency, universal approximation capability, generalization ability, and simplicity, which have greatly inspired the development of many ELM-based algorithms and their applications during the past decades. In this paper, we mainly provide a comprehensive review on ELM theoretical research and its variants in data stream classification, and categorize these algorithms from different perspectives. Firstly, we briefly introduce the basic principles of ELM and its characteristics. Secondly, we give an overview of different ELM variants to address the particular issues of data stream classification. Thirdly, we present an overview of different strategies to optimize the ELM, which have further improved the stability, accuracy and generalization ability of ELM, and briefly introduce some practical applications of ELM in data stream classification. Finally, we conduct several groups of experiments to compare the performance of ELM based models addressing the focused issues. Also, the open issues and prospects of ELM models used for stream classification are discussed, which are worthwhile to be further studied in the future.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Data streams are unbounded, sequential data instances that are generated with high Velocity. Classifying sequential data instances is a very challenging problem in machine learning with applications ...in network intrusion detection, financial markets and applications requiring real-time sensor-networks-based situation assessment. Data stream classification is concerned with the automatic labelling of unseen instances from the stream in real-time. For this the classifier needs to adapt to concept drifts and can only have a single pass through the data if the stream is fast moving. This research paper presents work on a real-time pre-processing technique, in particular feature tracking. The feature tracking technique is designed to improve Data Stream Mining (DSM) classification algorithms by enabling and optimising real-time feature selection. The technique is based on tracking adaptive statistical summaries of the data and class label distributions, known as Micro-Clusters. Currently the technique is able to detect concept drifts and identify which features have been influential in the drift.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK, ZRSKP
Sum–product network (SPN) is a deep probabilistic representation that allows for exact and tractable inference. There has been a trend of online SPN structure learning from massive and continuous ...data streams. However, online structure learning of SPNs has been introduced only for the generative settings so far. In this paper, we present an online discriminative approach for SPNs for learning both the structure and parameters. The basic idea is to keep track of informative and representative examples to capture the trend of time-changing class distributions. Specifically, by estimating the goodness of model fitting of data points and dynamically maintaining a certain amount of informative examples over time, we generate new sub-SPNs in a recursive and top-down manner. Meanwhile, an outlier-robust margin-based log-likelihood loss is applied locally to each data point and the parameters of SPN are updated continuously using most probable explanation (MPE) inference. This leads to a fast yet powerful optimization procedure and improved discrimination capability between the genuine class and rival classes. Empirical results show that the proposed approach achieves better prediction performance than the state-of-the-art online structure learner for SPNs, while promising order-of-magnitude speedup. Comparison with state-of-the-art stream classifiers further proves the superiority of our approach.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Abstract
The traditional models for mining frequent itemsets mainly focus on the frequency of the items listed in the respective dataset. However, market basket analysis and other domains generally ...prefer utility obtained from items regardless of their frequencies in the transactions. One of the main options of utility in these domains could be profit. Therefore, it is significant to extract items that generate more profit than items that occurs more frequently in the dataset. Thus, mining high utility itemset has emerged recently as a prominent research topic in the field of data mining. Many of the existing researches have been proposed for mining high utility itemset from static data. However, with the recent advanced technologies, streaming data has become a good source for data in many applications. Mining high utility itemset over data streams is a more challenging task because of the uncertainty in data streams, processing time, and many more. Although some works have been proposed for mining high utility itemset over data streams, many of these works require multiple database scans and they require long processing time. In respect to this, we proposed a single-pass fast-search model in which we introduced a utility factor known as utility stream level for tracing the utility value of itemsets from data streams. The simulation study shows that the performance of the proposed model is more significant compared with the contemporary method. The comparison has been performed based on metrics like process-completion time and utilized search space.