Display omitted
•Lightweight and flexible extension of online ensembles with abstaining classifiers.•Dynamic selection of base classifiers to exploit their underlying diversity.•Improved recovery due ...to promoting classifiers correctly anticipating concept drifts.•Increased robustness to the presence of noise in data streams.•Thorough experimental study with analysis on 12 canonical and 120 noisy streams.
Mining data streams is among most vital contemporary topics in machine learning. Such scenario requires adaptive algorithms that are able to process constantly arriving instances, adapt to potential changes in data, use limited computational resources, as well as be robust to any atypical events that may appear. Ensemble learning has proven itself to be an effective solution, as combining learners leads to an improved predictive power, more flexible drift handling, as well as ease of being implemented in high-performance computing environments. In this paper, we propose an enhancement of popular online ensembles by augmenting them with abstaining option. Instead of relying on a traditional voting, classifiers are allowed to abstain from contributing to the final decision. Their confidence level is being monitored for each incoming instance and only learners that exceed certain threshold are selected. We introduce a dynamic and self-adapting threshold that is able to adapt to changes in the data stream, by monitoring outputs of the ensemble and allowing to exploit underlying diversity in order to efficiently anticipate drifts. Additionally, we show that forcing uncertain classifiers to abstain from making a prediction is especially useful for noisy data streams. Our proposal is a lightweight enhancement that can be applied to any online ensemble method, improving its robustness to drifts and noise. Thorough experimental analysis validated through statistical tests proves the usefulness of the proposed approach.
Mode tracking using multiple data streams Bouguelia, Mohamed-Rafik; Karlsson, Alexander; Pashami, Sepideh ...
Information fusion,
09/2018, Volume:
43
Journal Article
Peer reviewed
•A method for tracking high-level conceptual modes from multiple streams is proposed.•It is based on aggregating signals into features, clustering, and Bayesian tracking.•Results show an accurate ...detection of mode transitions, enabling real-time tracking.
Most existing work in information fusion focuses on combining information with well-defined meaning towards a concrete, pre-specified goal. In contradistinction, we instead aim for autonomous discovery of high-level knowledge from ubiquitous data streams. This paper introduces a method for recognition and tracking of hidden conceptual modes, which are essential to fully understand the operation of complex environments, and an important step towards building truly intelligent aware systems. We consider a scenario of analyzing usage of a fleet of city buses, where the objective is to automatically discover and track modes such as highway route, heavy traffic, or aggressive driver, based on available on-board signals. The method we propose is based on aggregating the data over time, since the high-level modes are only apparent in the longer perspective. We search through different features and subsets of the data, and identify those that lead to good clusterings, interpreting those clusters as initial, rough models of the prospective modes. We utilize Bayesian tracking in order to continuously improve the parameters of those models, based on the new data, while at the same time following how the modes evolve over time. Experiments with artificial data of varying degrees of complexity, as well as on real-world datasets, prove the effectiveness of the proposed method in accurately discovering the modes and in identifying which one best explains the current observations from multiple data streams.
Twitter is among the fastest‐growing microblogging and online social networking services. Messages posted on Twitter (tweets) have been reporting everything from daily life stories to the latest ...local and global news and events. Monitoring and analyzing this rich and continuous user‐generated content can yield unprecedentedly valuable information, enabling users and organizations to acquire actionable knowledge. This article provides a survey of techniques for event detection from Twitter streams. These techniques aim at finding real‐world occurrences that unfold over space and time. In contrast to conventional media, event detection from Twitter streams poses new challenges. Twitter streams contain large amounts of meaningless messages and polluted content, which negatively affect the detection performance. In addition, traditional text mining techniques are not suitable, because of the short length of tweets, the large number of spelling and grammatical errors, and the frequent use of informal and mixed language. Event detection techniques presented in literature address these issues by adapting techniques from various fields to the uniqueness of Twitter. This article classifies these techniques according to the event type, detection task, and detection method and discusses commonly used features. Finally, it highlights the need for public benchmarks to evaluate the performance of different detection approaches and various features.
Most stream classifiers are designed to process data incrementally, run in resource-aware environments, and react to concept drifts, i.e., unforeseen changes of the stream’s underlying data ...distribution. Ensemble classifiers have become an established research line in this field, mainly due to their modularity which offers a natural way of adapting to changes. However, in environments where class labels are available after each example, ensembles which process instances in blocks do not react to sudden changes sufficiently quickly. On the other hand, ensembles which process streams incrementally, do not take advantage of periodical adaptation mechanisms known from block-based ensembles, which offer accurate reactions to gradual and incremental changes. In this paper, we analyze if and how the characteristics of block and incremental processing can be combined to produce new types of ensemble classifiers. We consider and experimentally evaluate three general strategies for transforming a block ensemble into an incremental learner: online component evaluation, the introduction of an incremental learner, and the use of a drift detector. Based on the results of this analysis, we put forward a new incremental ensemble classifier, called Online Accuracy Updated Ensemble, which weights component classifiers based on their error in constant time and memory. The proposed algorithm was experimentally compared with four state-of-the-art online ensembles and provided best average classification accuracy on real and synthetic datasets simulating different drift scenarios.
Data stream mining has been receiving increased attention due to its presence in a wide range of applications, such as sensor networks, banking, and telecommunication. One of the most important ...challenges in learning from data streams is reacting to concept drift, i.e., unforeseen changes of the stream's underlying data distribution. Several classification algorithms that cope with concept drift have been put forward, however, most of them specialize in one type of change. In this paper, we propose a new data stream classifier, called the Accuracy Updated Ensemble (AUE2), which aims at reacting equally well to different types of drift. AUE2 combines accuracy-based weighting mechanisms known from block-based ensembles with the incremental nature of Hoeffding Trees. The proposed algorithm is experimentally compared with 11 state-of-the-art stream methods, including single classifiers, block-based and online ensembles, and hybrid approaches in different drift scenarios. Out of all the compared algorithms, AUE2 provided best average classification accuracy while proving to be less memory consuming than other ensemble approaches. Experimental results show that AUE2 can be considered suitable for scenarios, involving many types of drift as well as static environments.
In our society, many fields have produced a large number of data streams. How to mining the interesting knowledge and patterns from continuous data stream becomes a problem which we have to solve. ...Different from conventional classification algorithms, data stream classification algorithms have to adjust their classification models with the change of data stream because of concept drift. However, conventional classification models will keep stable once models are trained. To solve the problem, a dynamic extreme learning machine for data stream classification (DELM) is proposed. DELM utilizes online learning mechanism to train ELM as basic classifier and trains a double hidden layer structure to improve the performance of ELM. When an alert about concept drift is set, more hidden layer nodes are added into ELM to improve the generalization ability of classifier. If the value measuring concept drift reaches the upper limit or the accuracy of ELM is in a low level, the current classifier will be deleted, and the algorithm will use new data to train a new classifier so as to learn new concept. The experimental results showed DELM could improve the accuracy of classification result, and can adapt to new concept in a short time.
The main challenge in large-scale data stream analytics lies in the ability of machine learning to generate large-scale data knowledge in reasonable timeframe without suffering from a loss of ...accuracy. Many distributed machine learning frameworks have recently been built to speed up the large-scale data learning process. However, most distributed machine learning used in these frameworks still uses an offline algorithm model which cannot cope with the data stream problems. In fact, large-scale data are mostly generated by the non-stationary data stream where its pattern evolves over time. To address this problem, we propose a novel Evolving Large-scale Data Stream Analytics framework based on a Scalable Parsimonious Network based on Fuzzy Inference System (Scalable PANFIS), where the PANFIS evolving algorithm is distributed over the worker nodes in the cloud to learn large-scale data stream. Scalable PANFIS framework incorporates the active learning (AL) strategy and two model fusion methods. The AL accelerates the distributed learning process to generate an initial evolving large-scale data stream model (initial model), whereas the two model fusion methods aggregate an initial model to generate the final model. The final model represents the update of current large-scale data knowledge which can be used to infer future data. Extensive experiments on this framework are validated by measuring the accuracy and running time of four combinations of Scalable PANFIS and other Spark-based built in algorithms. The results indicate that Scalable PANFIS with AL improves the training time to be almost two times faster than Scalable PANFIS without AL. The results also show both rule merging and the voting mechanisms yield similar accuracy in general among Scalable PANFIS algorithms and they are generally better than Spark-based algorithms. In terms of running time, the Scalable PANFIS training time outperforms all Spark-based algorithms when classifying a multi-class label dataset.
•Adaptive iterations (AdIter) method helps model adapt to different drift severities.•A simple bound analysis shows how the concept drift severity influences model error.•Experiments on synthetic and ...real-world datasets show efficiency of AdIter method.
As an excellent ensemble algorithm, Gradient Boosting Decision Tree (GBDT) has been tested extensively with static data. However, real-world applications often involve dynamic data streams, which suffer from concept drift problems where the data distribution changes overtime. The performance of GBDT model is degraded when applied to predict data streams with concept drift. Although incremental learning can help to alleviate such degrading, finding a perfect learning rate (i.e., the iteration in GBDT) that suits all time periods with all their different drift severity levels can be difficult. In this paper, we convert the issue of determining an optimal learning rate into the issue of choosing the best adaptive iterations when tuning GBDT. We theoretically prove that drift severity is closely related to the convergence rate of model. Accordingly, we propose a novel drift adaptation method, called adaptive iterations (AdIter), that automatically chooses the number of iterations for different drift severities to improve the prediction accuracy for data streams under concept drift. In a series of comprehensive tests with seven state-of-the-art drift adaptation methods on both synthetic and real-world data, AdIter yielded superior accuracy levels.