•The MVRL method learns a fused sparse affinity matrix across multiple views.•The MVRL method captures the global and local structures of data objects.•The complementary information is explored by ...exploiting affinity matrices.•The upper bound of computational cost is determined by closed-form solutions.•The dynamic set transfers previously learned knowledge to the arrival data objects.
Data stream clustering provides valuable insights into the evolving patterns of long sequences of continuously generated data objects. Most existing clustering methods focus on single-view data streams. In this paper, we propose a multi-view representation learning (MVRL) method for multi-view clustering of data streams. We first introduce an integrated representation learning model to learn a fused sparse affinity matrix across multiple views for spectral clustering. Motivated by the optimization procedure of the integrated representation learning model, we propose three consecutive stages: collaborative representation, the construction of individual global affinity matrices using a mapping function, and the calculation of a fused sparse affinity matrix using Euclidean projection. These stages allow the effective capture of the global and local structures of high-dimensional data objects. Moreover, each stage has a closed-form solution, which determines the upper bound of the computational cost and memory consumption. We then employ the construction residuals of the collaborative representation to adaptively update a dynamic set, which is used to preserve the representative data objects. The dynamic set efficiently transfers previously learned useful knowledge to the arriving data objects. Extensive experimental results on multi-view data stream datasets demonstrate the effectiveness of the proposed MVRL method.
Multi-label data streams are sequences of multi-label instances arriving over time to a multi-label classifier. The properties of the stream may continuously change due to concept drift. Therefore, ...algorithms must constantly adapt to the new data distributions. In this paper we propose a novel ensemble method for multi-label drifting streams named Adaptive Ensemble of Self-Adjusting Nearest Neighbor Subspaces (AESAKNNS). It leverages a self-adjusting kNN as a base classifier with the advantages of ensembles to adapt to concept drift in the multi-label environment. To promote diverse knowledge within the ensemble, each base classifier is given a unique subset of features and samples to train on. These samples are distributed to classifiers in a probabilistic manner that follows a Poisson distribution as in online bagging. Accompanying these mechanisms, a collection of ADWIN detectors monitor each classifier for the occurrence of a concept drift on the subspace. Upon detection, the algorithm automatically trains additional classifiers in the background to attempt to capture new concepts on new subspaces of features. The dynamic classifier selection chooses the most accurate classifiers from the active and background ensembles to replace the current ensemble. Our experimental study compares the proposed approach with 30 other classifiers, including problem transformation, algorithm adaptation, kNNs, and ensembles on 30 diverse multi-label datasets and 12 performance metrics. Results, validated using non-parametric statistical analysis, support the better performance of the AESAKNNS and highlight the contribution of its components in improving the performance of the ensemble.
This work aims to connect two rarely combined research directions, i.e., non-stationary data stream classification and data analysis with skewed class distributions. We propose a novel framework ...employing stratified bagging for training base classifiers to integrate data preprocessing and dynamic ensemble selection methods for imbalanced data stream classification. The proposed approach has been evaluated based on computer experiments carried out on 135 artificially generated data streams with various imbalance ratios, label noise levels, and types of concept drift as well as on two selected real streams. Four preprocessing techniques and two dynamic selection methods, used on both bagging classifiers and base estimators levels, were considered. Experimentation results showed that, for highly imbalanced data streams, dynamic ensemble selection coupled with data preprocessing could outperform online and chunk-based state-of-art methods.
•Dynamic classifier selection for non-stationary imbalanced data stream.•Forming classifier ensemble based on stratified bagging.•Employing oversampling and undersampling techniques to prepare DSEL.•Experiments show the effectiveness of preprocessed DES for difficult data streams.
Wilcoxon Rank Sum Test Drift Detector Barros, Roberto Souto Maior de; Hidalgo, Juan Isidro González; Cabral, Danilo Rafael de Lima
Neurocomputing (Amsterdam),
01/2018, Volume:
275
Journal Article
Peer reviewed
Online learning regards extracting information from large quantities of data (streams) usually affected by changes in the distribution (concept drift). Drift detectors are software that estimate the ...positions of these changes to substitute the base learner and ultimately improve accuracy. Statistical Test of Equal Proportions (STEPD) is a simple, well-known, efficient detector which uses a hypothesis test between two proportions to signal the concept drifts. However, despite identifying the existing drifts close to their correct positions, STEPD tends to identify many false positives. This article examines the application of the Wilcoxon rank sum statistical test for concept drift detection, proposing WSTD. Experiments run in the MOA framework using four artificial dataset generators, with abrupt and gradual drift versions of three sizes, as well as seven real-world datasets, suggest WSTD improves the detections of STEPD and other methods as well as their accuracies in many scenarios.
Recently, skyline query processing over data stream has gained a lot of attention especially from the database community owing to its own unique challenges. Skyline queries aims at pruning a search ...space of a potential large multi-dimensional set of objects by keeping only those objects that are not worse than any other. Although an abundance of skyline query processing techniques have been proposed, there is a lack of a Systematic Literature Review (SLR) on current research works pertinent to skyline query processing over data stream. In regard to this, this paper provides a comparative study on the state-of-the-art approaches over the period between 2000 and 2022 with the main aim to help readers understand the key issues which are essential to consider in relation to processing skyline queries over streaming data. Seven digital databases were reviewed in accordance with the Preferred Reporting Items for Systematic Reviews (PRISMA) procedures. After applying both the inclusion and exclusion criteria, 23 primary papers were further examined. The results show that the identified skyline approaches are driven by the need to expedite the skyline query processing mainly due to the fact that data streams are time varying (time sensitive), continuous, real time, volatile, and unrepeatable. Although, these skyline approaches are tailored made for data stream with a common aim, their solutions vary to suit with the various aspects being considered, which include the type of skyline query, type of streaming data, type of sliding window, query processing technique, indexing technique as well as the data stream environment employed. In this paper, a comprehensive taxonomy is developed along with the key aspects of each reported approach, while several open issues and challenges related to the topic being reviewed are highlighted as recommendation for future research direction.
With the development of the Internet of Things technology, the current amount of data generated by the Internet of Things system is increasing, and these data are continuously transmitted to the data ...center. The data processing and analysis of the traditional Internet of Things system are inefficient and cannot handle such a large number of data streams. In addition, the IoT smart device has a resource-limited feature, which cannot be ignored when analyzing data. This paper proposes a new architecture ApproxECIoT (Approximate Edge Computing Internet of Things, ApproxECIoT) suitable for real-time data stream processing of the Internet of Things. It implements a self-adjusting stratified sampling algorithm to process real-time data streams. The algorithm adjusts the size of the sample stratums according to the variance of each stratum while maintaining the given memory budget. This is beneficial to improve the accuracy of the calculation results when resources are limited. Finally, the experimental analysis was performed using synthetic datasets and real-world datasets, the results show that ApproxECIoT can still obtain high-accuracy calculation results when using memory resources similar to simple random sampling. In the case of synthetic data streams, when the sampling ratio is 10%, compared with CalculIoT, the accuracy loss of ApproxECIoT is reduced by 89.6%; compared with SRS, the accuracy loss of ApprxoECIoT is reduced by 99.8%. In the case of using the real data stream of the wireless sensor network, the performance of ApproxECIoT is not the best, but as the sampling ratio increases, the accuracy loss of ApproxECIoT decreases more than other frameworks.
Network security has always been a concern because it remains to be an unresolved problem. Unlike signature-based methods, anomaly-based methods can detect novel attacks and thus have gained ...increasing attention over the past decades. However, as the huge and unbounded network data samples continuously arrive at an unprecedented rate and always evolve and change, building a precise network normal pattern has become extremely difficult. In this study, an evolving anomaly detection method for network streaming data is proposed. Clusters are incrementally updated as the new network samples arrive at the incremental updating phase. The outliers, which include not only the global outliers but also the local outliers, are detected using the local density and global density thresholds at the anomaly detection phase. Meanwhile, a buffer is used to store temporary outliers, which may subsequently become normal samples, to avoid normal network samples being deleted as outliers.
Three prominent streaming data (packet-based KDDCUP’99, NSL_KDD, and flow-based CIDDS-001) are used to validate the proposed algorithm. The detection rate of the proposed algorithm can achieve the best result. The result is nearly 100% on KDDCUP’99 and CIDDS-001. The false positive rate and accuracy are 0.0125 and 0.9886 on CIDDS-001, respectively. Experimental results indicate that the proposed algorithm can process real-time network anomaly detection with a much lower time and memory computational cost, and it outperforms other unsupervised anomaly detection methods and most supervised anomaly detection methods reported in the literature in terms of detection rate, false-positive rate, and detection accuracy.
Learning non-stationary data streams is challenging due to the unique characteristics of infinite length and evolving property. Current existing works often concentrate on the concept-drift problem ...in data streams. Concept evolution, indicating novel classes are emerged in data streams, has gained growing attention recently due to its practical values in many real-world applications. Thereby, how to design a new robust learning model on data streams to handle concept drift, concept evolution and outliers simultaneously, is of significant importance. To this end, we propose a new data stream classification approach, called EMC, which dynamically learns the Evolving Micro-Clusters to examine both concept drift and evolution. Specifically, to capture time-changing concept, EMC dynamically maintains a set of online micro-clusters and learns their importance with error-based representative learning. Building upon the evolving micro-clusters, the novel class detector is introduced based on a local density perspective, which allows handling the data streams with complex class distribution. Beyond, EMC allows distinguishing concept drift and evolution from noisy instances. Extensive experiments on both synthetic and real-world data sets show that our method has good classification and novel class detection performance compared to state-of-the-art algorithms.