Multivariate time series are widely used in industrial equipment monitoring and maintenance, health monitoring, weather forecasting and other fields. Due to abnormal sensors, equipment failures, ...environmental interference and human errors, the collected multivariate time series usually have certain missing values. Missing values imply the regularity of data, and seriously affect the further analysis and application of multivariate time series. Conventional imputation methods such as statistical imputation and machine learning-based imputation cannot learn the latent relationships of data and are difficult to use for missing values imputation in multivariate time series. This paper proposes a novel Time and Location Gated Recurrent Unit (TLGRU), which takes into account the non-fixed time intervals and location intervals in multivariate time series and effectively deals with missing values. We made necessary modifications to the architecture of the end-to-end imputation model
E
2
GAN and replaced Gated Recurrent Unit for Imputation (GRUI) with TLGRU to make the generated fake sample closer to the original sample. Experiments on a public meteorologic dataset show that our method outperforms the baselines on the imputation accuracy and achieves a new state-of-the-art result.
•We apply random feature functions to improve data streaming learners.•We improve Hoeffding tree, nearest neighbor, and gradient descent methods.•We further extend to GPU and neural networks using a ...random projection layer.•We get encouraging positive results on realworld datasets.•We highlight important issues and promising future work in streaming classification.
Display omitted
Big Data streams are being generated in a faster, bigger, and more commonplace. In this scenario, Hoeffding Trees are an established method for classification. Several extensions exist, including high-performing ensemble setups such as online and leveraging bagging. Also, k-nearest neighbors is a popular choice, with most extensions dealing with the inherent performance limitations over a potentially-infinite stream.
At the same time, gradient descent methods are becoming increasingly popular, owing in part to the successes of deep learning. Although deep neural networks can learn incrementally, they have so far proved too sensitive to hyper-parameter options and initial conditions to be considered an effective ‘off-the-shelf’ data-streams solution.
In this work, we look at combinations of Hoeffding-trees, nearest neighbor, and gradient descent methods with a streaming preprocessing approach in the form of a random feature functions filter for additional predictive power.
We further extend the investigation to implementing methods on GPUs, which we test on some large real-world datasets, and show the benefits of using GPUs for data-stream learning due to their high scalability.
Our empirical evaluation yields positive results for the novel approaches that we experiment with, highlighting important issues, and shed light on promising future directions in approaches to data-stream classification.
The Internet of Things was born from the proliferation of connected objects and is known as the third era of information technology. It results in the availability of a huge amount of continuously ...acquired data which need to be processed to be more valuable. This leads to a real paradigm shift: instead of processing fixed data like classical databases or files, the new algorithms have to deal with data streams which bring their own set of requirements. Researchers address new challenges in the way of storing, querying and processing those data which are always in motion. In many decision making scenarios, fuzzy expert systems have been useful to deduce a more conceptual knowledge from data. With the emergence of the Internet of Things and the growing presence of cloud-based architectures, it is necessary to improve fuzzy expert systems to support higher level operators, large rule bases and an abundant flow of inputs. In this paper, we introduce a modular fuzzy expert system which takes data or event streams in input and which outputs decisions on the fly. Its architecture relies on both a graph-based representation of the rule base and the cooperation of four customizable modules. Stress tests regarding the number of rules have been carried out to characterize its efficiency.
The adoption of weighted regularized extreme learning machines (WR-ELMs) has been recognized as an effective approach to addressing class imbalance by differentially weighting sample classes. ...Traditional batch learning methodologies, however, falter due to their inefficiency in adapting to network restructuring and their inability to process streaming data. By introducing incremental and sequential learning, the incremental weighted regularized extreme learning machine (IWR-ELM) and the online weighted regularized extreme learning machine (OWR-ELM) are proposed in this paper to enhance WR-ELM’s flexibility and responsiveness. Specifically, the IWR-ELM facilitates optimal hidden layer node selection, thereby enhancing model adaptability without necessitating full retraining. Conversely, the OWR-ELM is engineered for real-time data stream processing, enabling continuous learning from new data segments without retaining outdated information. We also address the concurrent challenges of concept drift and class imbalance by presenting an enhanced online weighted regularized extreme learning machine, which incorporates enhancement factors to elevate the significance of recent data. Finally, the competitiveness of our proposed algorithm is demonstrated in terms of its training time and performance through extensive experiments conducted on class-imbalanced datasets. Our comprehensive evaluations on diverse class-imbalanced datasets affirm the superior efficiency and performance of our proposed solutions in terms of training speed and accuracy.
Heterogeneous mobile, sensor, IoT, smart environment, and social networking applications have recently started to produce unbounded, fast, and massive-scale streams of data that have to be processed ...“on the fly”. Systems that process such data have to be enhanced with detection for operational exceptions and with triggers for both automated and manual operator actions. In this paper, we illustrate how tracing in distributed data processing systems can be applied to detecting changes in data and operational environment to maintain the efficiency of heterogeneous data stream processing systems under potentially changing data quality and distribution. By the tracing of individual input records, we can (1) identify outliers in a web crawling and document processing system and use the insights to define URL filtering rules; (2) identify heavy keys, such as NULL, that should be filtered before processing; (3) give hints to improve the key-based partitioning mechanisms; and (4) measure the limits of overpartitioning if heavy thread-unsafe libraries are imported.
By using Apache Spark as illustration, we show how various data stream processing efficiency issues can be mitigated or optimized by our distributed tracing engine. We describe and qualitatively compare two different designs, one based on reporting to a distributed database and another based on trace piggybacking. Our prototype implementation consists of wrappers suitable for JVM environments in general, with minimal impact on the source code of the core system. Our tracing framework is the first to solve tracing in multiple systems across boundaries and to provide detailed performance measurements suitable for automated optimization, not just debugging.
•We consider holistic tracing of record lineages. Our goal is to detect inefficiencies to increase performance of the compute topology, reduce tail-latency and better utilize the underlying platform. In this paper we focus on the tracing design and practical problems that can be solved using our framework. We present a generic tracing framework design for batch and streaming DDPS.•We provide two different prototype implementations, both built by a minimal code impact for Apache Spark.•We experiment with traced Spark applications to obtain low-level UDF metrics and detailed representation of causality of individual records.•We measure and compare the overhead of our tracing frameworks. We identify direct reporting as a low-impact solution for monitoring system efficiency.•Using the tracing framework, we show that common complex, inter- or intra-system data pipelines can be optimized by identifying issues which are hard to detect otherwise.
Secure Provenance Transmission for Streaming Data Sultana, S.; Shehab, M.; Bertino, E.
IEEE transactions on knowledge and data engineering,
2013-Aug., 2013-08-00, 20130801, Volume:
25, Issue:
8
Journal Article
Peer reviewed
Many application domains, such as real-time financial analysis, e-healthcare systems, sensor networks, are characterized by continuous data streaming from multiple sources and through intermediate ...processing by multiple aggregators. Keeping track of data provenance in such highly dynamic context is an important requirement, since data provenance is a key factor in assessing data trustworthiness which is crucial for many applications. Provenance management for streaming data requires addressing several challenges, including the assurance of high processing throughput, low bandwidth consumption, storage efficiency and secure transmission. In this paper, we propose a novel approach to securely transmit provenance for streaming data (focusing on sensor network) by embedding provenance into the interpacket timing domain while addressing the above mentioned issues. As provenance is hidden in another host-medium, our solution can be conceptualized as watermarking technique. However, unlike traditional watermarking approaches, we embed provenance over the interpacket delays (IPDs) rather than in the sensor data themselves, hence avoiding the problem of data degradation due to watermarking. Provenance is extracted by the data receiver utilizing an optimal threshold-based mechanism which minimizes the probability of provenance decoding errors. The resiliency of the scheme against outside and inside attackers is established through an extensive security analysis. Experiments show that our technique can recover provenance up to a certain level against perturbations to inter-packet timing characteristics.
Tourism crowdsourcing platforms accumulate and use large volumes of feedback data on tourism-related services to provide personalized recommendations with high impact on future tourist behavior. ...Typically, these recommendation engines build individual tourist profiles and suggest hotels, restaurants, attractions or routes based on the shared ratings, reviews, photos, videos or likes. Due to the dynamic nature of this scenario, where the crowd produces a continuous stream of events, we have been exploring stream-based recommendation methods, using stochastic gradient descent (SGD), to incrementally update the prediction models and post-filters to reduce the search space and improve the recommendation accuracy. In this context, we offer an update and comment on our previous article (Veloso et al., 2019a) by providing a recent literature review and identifying the challenges laying ahead concerning the online recommendation of tourism resources supported by crowdsourced data.
•Source of the abnormal data detection is proposed.•Optimized clustering methodology is adopted.•Results show that accuracy of data anomaly detection has improved.
When detecting abnormal data in the ...sensor network data stream, it is necessary to accurately obtain the source of the abnormal data. The traditional data stream clustering algorithm has the disadvantages of large clustering information loss and low accuracy. Therefore, this paper proposes a sensor network data stream anomaly detection method based on optimized clustering. Firstly, the proposed sampling algorithm is used to sample the data stream. The sampling result is used as a sample set. Use dynamic data histogram to divide the data dimension into different dimension groups, calculate the maximum entropy division dimension space cluster of each dimension, and aggregate the data of the same dimension cluster into the micro cluster. The abnormality detection of the data stream is realized by comparing the information entropy size of the micro cluster and its distribution characteristics. The experimental results show that the proposed algorithm can improve the accuracy and effectiveness of data stream anomaly detection.
Novelty detection and concept drift detection are essential for the plethora of machine learning applications. The statistical properties of application generated data change over time in the ...streaming environment, known as concept drift. These changes develop a profound influence on the learning model’s performance. Along with concept drift, the new class emergence (i.e., novel class/novelty detection) is also challenging in the non-stationary distribution of data. Novel class detection finds whether the identifying data points of a data stream are unknown or unusual. The paper presents a survey focusing on the challenges encountered while dealing with real-time data. In addition to this, the chronological discussion on the various existing novelty detectors with their advantages, limitations, critical points, the different research prospect, and future directions are also incorporated in the paper.
Continual learning from data streams is among the most important topics in contemporary machine learning. One of the biggest challenges in this domain lies in creating algorithms that can ...continuously adapt to arriving data. However, previously learned knowledge may become outdated, as streams evolve over time. This phenomenon is known as concept drift and must be detected to facilitate efficient adaptation of the learning model. While there exists a plethora of drift detectors, all of them assume that we are dealing with roughly balanced classes. In the case of imbalanced data streams, those detectors will be biased towards the majority classes, ignoring changes happening in the minority ones. Furthermore, class imbalance may evolve over time and classes may change their roles (majority becoming minority and vice versa). This is especially challenging in the multi-class setting, where relationships among classes become complex. In this paper, we propose a detailed taxonomy of challenges posed by concept drift in multi-class imbalanced data streams, as well as a novel trainable concept drift detector based on Restricted Boltzmann Machine. It is capable of monitoring multiple classes at once and using reconstruction error to detect changes in each of them independently. Our detector utilizes a skew-insensitive loss function that allows it to handle multiple imbalanced distributions. Due to its trainable nature, it is capable of following changes in a stream and evolving class roles, as well as it can deal with local concept drift occurring in minority classes. Extensive experimental study on multi-class drifting data streams, enriched with a detailed analysis of the impact of local drifts and changing imbalance ratios, confirms the high efficacy of our approach.