•Enhancing drift adaptation in sparsely labeled data streams at no additional cost.•Instance exploitation techniques to empower active learning and avoid underfitting.•Ensemble architectures ...adaptively switching between risky and standard adaptation.•Flexible framework that can be used to enhance any online active learning algorithm.•In-depth analysis of enhanced drift adaptation via extensive experimental study.
Continual learning from streaming data sources becomes more and more popular due to the increasing number of online tools and systems. Dealing with dynamic and everlasting problems poses new challenges for which traditional batch-based offline algorithms turn out to be insufficient in terms of computational time and predictive performance. One of the most crucial limitations is that we cannot assume having an access to a finite and complete data set – we always have to be ready for new data that may complement our model. This poses a critical problem of providing labels for potentially unbounded streams. In real world, we are forced to deal with very strict budget limitations, therefore, we will most likely face the scarcity of annotated instances, which are essential in supervised learning. In our work, we emphasize this problem and propose a novel instance exploitation technique. We show that when: (i) data is characterized by temporary non-stationary concepts, and (ii) there are very few labels spanned across a long time horizon, it is actually better to risk overfitting and adapt models more aggressively by exploiting the only labeled instances we have, instead of sticking to a standard learning mode and suffering from severe underfitting. We present different strategies and configurations for our methods, as well as an ensemble algorithm that attempts to maintain a sweet spot between risky and normal adaptation. Finally, we conduct a complex in-depth comparative analysis of our methods, using state-of-the-art streaming algorithms relevant for the given problem.
Online learning and real-time data processing are becoming increasingly vital across various domains such as sensor networks, banking, and telecommunications. A significant challenge in this context ...is concept drift, wherein the statistical properties of the data change over time. Traditional drift detectors often grapple with high memory usage, extended delay in detection, prolonged runtime, and accuracy inconsistencies. This paper introduces a novel Online Drift Detector that meticulously balances these four aspects. By processing data instance-by-instance, our proposed detector optimizes the trade-offs between delay detection, runtime, memory consumption, and accuracy. We incorporate a unique diversity calculation tailored for multi-label problems, ensuring swift drift detection with minimized memory usage and enhanced runtime efficiency. Comparative analyses reveal the dominance of our approach over contemporary drift detection techniques, particularly in the realms of memory efficiency, detection speed, and accuracy. This work substantially augments the field of online data stream processing by offering a refined strategy for timely and efficient concept drift detection across a myriad of applications.
In industrial settings, querying data streams from Internet of Things (IoT) devices benefits from utilizing elastic criteria to enhance the interpretability of the current state of the monitored ...environment. Fuzzy sets provide this elasticity, enabling the aggregation and representation of similar values in a human-comprehensible manner. However, many sensor signals exhibit temporal oscillations, leading to varying interpretations of the signal based on its current trend (rising or falling). This hysteresis in signal (and subsequently of the production device) interpretation inspired us to introduce this phenomenon into data stream processing, resulting in the novel concept of hysteretic fuzzy sets. This paper demonstrates how fuzzy searching and grouping can be applied to IoT sensor signals in flexible Big Data stream processing on Apache Kafka. We illustrate the impact of data stream querying with KSQL queries involving fuzzy sets (encompassing fuzzy filtering of data stream events, fuzzy transformation of data stream attributes, fuzzy grouping, and joining) on the flexibility of executed operations and computational resources utilized by the Kafka processing engine. Finally, our experiments with hysteretic fuzzy sets while analyzing sensor signals in power plants demonstrate that this novel approach effectively reduces the number of alarms while monitoring the state of the production machine.
With the proliferation of sensors and IoT technologies, stream data are increasingly stored and analyzed, but rarely combined, due to the heterogeneity of sources and technologies. Semantics are ...increasingly used to share sensory data, but not so much for annotating stream data. Semantic models for stream annotation are scarce, as generally semantics are heavy to process and not ideal for Internet of things (IoT) environments, where the data are frequently updated. We present a light model to semantically annotate streams, IoT-Stream. It takes advantage of common knowledge sharing of the semantics, but keeping the inferences and queries simple. Furthermore, we present a system architecture to demonstrate the adoption the semantic model, and provide examples of instantiation of the system for different use cases. The system architecture is based on commonly used architectures in the field of IoT, such as web services, microservices and middleware. Our system approach includes the semantic annotations that take place in the pipeline of IoT services and sensory data analytics. It includes modules needed to annotate, consume, and query data annotated with IoT-Stream. In addition to this, we present tools that could be used in conjunction to the IoT-Stream model and facilitate the use of semantics in IoT.
Data stream mining has gained increasing attention in recent years due to its wide range of applications. In this paper, we propose a new selective prototype-based learning (SPL) method on evolving ...data streams, which dynamically maintains representative instances to capture the time-changing concepts, and make predictions in a local fashion. As an instance-based learning model, SPL only maintains some important prototypes (i.e., ISet) via error-driven representativeness learning. The fast condensed nearest neighbor (FCNN) rule, is further introduced to compress these prototypes, making the algorithm also applicable under memory constraints. To better distinguish noises from the instances associated with the new emerging concept, a potential concept instance set (i.e., PSet) is used to store all misclassified instances. Relying on the potential concept instance set, a local-aware distribution-based concept drift detection approach is proposed. SPL has several attractive benefits: (a) it can fit the evolving data streams very well by maintaining a small size of instance set; (b) it is capable of capturing both gradual and sudden concept drifts effectively; (c) it has great capabilities to distinguish noise/outliers from drifting instances. Experimental results show that the SPL has better classification performance than many other state-of-the-art algorithms.
Sketching Data Distribution by Rotation Lei, Runze; Wang, Pinghui; Li, Rundong ...
IEEE transactions on knowledge and data engineering,
2023
Journal Article
Peer reviewed
Kernel density estimation is a useful method for estimating the probability distribution of data. It is a challenge to achieve efficient kernel density estimation, especially for large-scale and ...high-dimension stream data. We propose rotation kernel , a novel kernel function for density estimation. The rotation kernel density can be fast estimated by a data structure named Rotation Kernel Density Sketch (RKDS). RKDS is a time- and memory-efficient method for kernel density estimation, even over data streams and distributed systems. RKDS is applicable for estimating density at specific points and also for representing data distribution. We provide theoretical analysis for rotation kernel and RKDS. Furthermore, we apply RKDS to outlier detection, concept drift detection, and personalized federated learning. Experiments show that our method improves time efficiency by up to <inline-formula><tex-math notation="LaTeX">3\times 10^{3}</tex-math></inline-formula> times compared with baselines. RKDS also provides comparable detecting precision and better delay on outlier detection and concept drift detection tasks.
•We discover the probabilistic frequent itemsets over uncertain data streams.•Two algorithm PFIMoS and PFIMoS+ were proposed to efficiently discover the results.•Our methods can achieve substantial ...speedups over the state-of-the-art algorithms.
This paper considers the problem of mining probabilistic frequent itemsets in the sliding window of an uncertain data stream. We design an effective in-memory index named PFIT to store the data synopsis, so the current probabilistic frequent itemsets can be output in real time. We also propose a depth-first algorithm, PFIMoS, to bottom-up build and maintain the PFIT dynamically. Because computing the probabilistic support is time consuming, we propose a method to estimate the range of probabilistic support by using the support and expected support, which can greatly reduce the runtime and memory usage. Nevertheless, massive probabilistic supports have to be computed when the minimum support is low over dense data, which may result in a drastic reduction of computing speed. We further address this problem with a heuristic rule-based algorithm, PFIMoS+, in which an error parameter is introduced to decrease the probabilistic support computing count. Theoretical analysis and experimental studies demonstrate that our proposed algorithms can efficiently reduce computing time and memory, ensure fast and exact mining of probabilistic data streams, and markedly outperform the state-of-the-art algorithms TODIS-Stream (Sun et al., 2010) and FEMP (Akbarinia & Masseglia, 2013).
In the era of the Internet of Everything, various wireless devices and sensors use spectrum, which is a precious and non-renewable resource, to communication. Due to the characteristics of massive, ...heterogeneous, and multi-source, the generated multi-source data stream brings difficulties to spectrum cognition. As a result, unreasonable spectrum allocation strategy leads to low utilization of spectrum resources. Optimizing spectrum allocation strategy can effectively improve spectrum utilization. Aiming at the problem of trapped local optimum solution in the genetic algorithm (GA) and particle swarm optimization algorithm (PSO), an improved monarch butterfly algorithm is proposed. Firstly, this paper employs the simulated annealing algorithm to select the migration rate, which increases the diversity of monarch butterfly population. Secondly, chaos mapping algorithm is utilized to improve the optimization ability and convergence speed. Finally, in the view of the problem that the monarch butterfly algorithm is easy to fall into the local optimal solution, there is no better way to escape from the local optimal solution. The Wolf pack updating operator is selected to improve the diversity of the population to generate new monarch butterflies. This method updates the population by generating new monarch butterfly individuals, so as to increasing the diversity of the population. The experimental results show that the improved monarch butterfly algorithm outperforms the other two algorithms in terms of convergence speed and system revenue.
Data has become an integral part of our society in the past years, arriving faster and in larger quantities than before. Traditional clustering algorithms rely on the availability of entire datasets ...to model them correctly and efficiently. Such requirements are not possible in the data stream clustering scenario, where data arrives and needs to be analyzed continuously. This paper proposes a novel evolutionary clustering algorithm, entitled EvolveCluster, capable of modeling evolving data streams. We compare EvolveCluster against two other evolutionary clustering algorithms, PivotBiCluster and Split-Merge Evolutionary Clustering, by conducting experiments on three different datasets. Furthermore, we perform additional experiments on EvolveCluster to further evaluate its capabilities on clustering evolving data streams. Our results show that EvolveCluster manages to capture evolving data stream behaviors and adapts accordingly.