Inducing adaptive predictive models in real-time from high throughput data streams is one of the most challenging areas of Big Data Analytics. The fact that data streams may contain concept drifts ...(changes of the pattern encoded in the stream over time) and are unbounded, imposes unique challenges in comparison with predictive data mining from batch data. Several real-time predictive data stream algorithms exist, however, most approaches are not naturally parallel and thus limited in their scalability. This paper highlights the Micro-Cluster Nearest Neighbour (MC-NN) data stream classifier. MC-NN is based on statistical summaries of the data stream and a nearest neighbour approach, which makes MC-NN naturally parallel. In its serial version MC-NN is able to handle data streams, the data does not need to reside in memory and is processed incrementally. MC-NN is also able to adapt to concept drifts. This paper provides an empirical study on the serial algorithm’s speed, adaptivity and accuracy. Furthermore, this paper discusses the new parallel implementation of MC-NN, its parallel properties and provides an empirical scalability study.
•A real-time data stream classifier adaptive to concept drift and robust to noise.•A parallel implementation of the real-time data stream classifier.•A discussion about using open source Big Data technologies for data stream mining.
•We propose treating spam detection on Twitter as an anomaly detection problem.•95 One-gram features are introduced to represent tweet content.•Due to the nature of Twitter, stream mining techniques ...are utilized to detect spam.•DenStream and StreamKM++ had roughly 98% detection with under 10% false positives.•Combination of algorithms above produced 100% detection with under 3% false positives.
The rapid growth of Twitter has triggered a dramatic increase in spam volume and sophistication. The abuse of certain Twitter components such as “hashtags”, “mentions”, and shortened URLs enables spammers to operate efficiently. These same features, however, may be a key factor in identifying new spam accounts as shown in previous studies. Our study provides three novel contributions. Firstly, previous studies have approached spam detection as a classification problem, whereas we view it as an anomaly detection problem. Secondly, 95 one-gram features from tweet text were introduced alongside the user information analyzed in previous studies. Finally, to effectively handle the streaming nature of tweets, two stream clustering algorithms, StreamKM++ and DenStream, were modified to facilitate spam identification. Both algorithms clustered normal Twitter users, treating outliers as spammers. Each of these algorithms performed well individually, with StreamKM++ achieving 99% recall and a 6.4% false positive rate; and DenStream producing 99% recall and a 2.8% false positive rate. When used in conjunction, these algorithms reached 100% recall and a 2.2% false positive rate, meaning that our system was able to identify 100% of the spammers in our test while incorrectly detecting only 2.2% of normal users as spammers.
The critical need for classifying streaming data arises from its widespread use in real-world industries, where analyzing continuous, dynamic, and evolving data streams accurately and promptly is ...essential for informed decision-making and gaining predictive insights. Existing research mainly focuses on abundant supervision, overlooking the scarcity and delayed availability of labels, which can vary in timing. Addressing this, the article introduces a new learning method that employs a synchronization-based core support extraction technique. This technique is designed to manage changing concepts and delayed partial labeling by extracting key data points that act as pseudo-labels. Thanks to the concept of synchronization, these extracted key data points accurately represent the inherent local cluster structure in an intuitive manner and better maintain the class structure. Consequently, these pseudo-labels are utilized for classifying future incoming data batches. Furthermore, the method incorporates a knowledge base to summarize and represent all incoming streaming data. Building upon this knowledge base, an ensemble model for classification and an efficient new class detector are proposed. Both operate in a local fashion to ensure robust learning, even in complex class distributions. Evaluations on benchmark datasets reveal a consistent performance lead, surpassing established algorithms by up to 10%, achieving state-of-the-art results.
•A new learning method to handle concept drift and delayed labeling.•A novel method for detecting emerging labels.•An ensemble classifier for robust classification.•The proposed method uses a knowledge base to summarize data.•The proposed method is robust and allows yielding state-of-the-art performance.
Nowadays, cyber-attacks have become a common and persistent issue affecting various human activities in modern societies. Due to the continuously evolving landscape of cyber-attacks and the growing ...concerns around “black box” models, there has been a strong demand for novel explainable and interpretable intrusion detection systems with online learning abilities. In this paper, a novel soft prototype-based autonomous fuzzy inference system (SPAFIS) is proposed for network intrusion detection. SPAFIS learns from network traffic data streams online on a chunk-by-chunk basis and autonomously identifies a set of meaningful, human-interpretable soft prototypes to build an IF-THEN fuzzy rule base for classification. Thanks to the utilization of soft prototypes, SPAFIS can precisely capture the underlying data structure and local patterns, and perform internal reasoning and decision-making in a human-interpretable manner based on the ensemble properties and mutual distances of data. To maintain a healthy and compact knowledge base, a pruning scheme is further introduced to SPAFIS, allowing itself to periodically examine the learned solution and remove redundant soft prototypes from its knowledge base. Numerical examples on public network intrusion detection datasets demonstrated the efficacy of the proposed SPAFIS in both offline and online application scenarios, outperforming the state-of-the-art alternatives.
In real-world applications, the geometric median is a natural quantity to consider for robust inference of location or central tendency, particularly when dealing with non-standard or irregular data ...distributions. An innovative online bootstrap inference algorithm, using the averaged nonlinear stochastic gradient algorithm, is proposed to make statistical inference about the geometric median from massive datasets. The method is computationally fast and memory-friendly, and it is easy to update as new data is received sequentially. The validity of the proposed online bootstrap inference is theoretically justified. Simulation studies under a variety of scenarios are conducted to demonstrate its effectiveness and efficiency in terms of computation speed and memory usage. Additionally, the online inference procedure is applied to a large publicly available dataset for skin segmentation.
The fast development of Internet of Things (IoT) computing and technologies has prompted a decentralization of Cloud-based systems. Indeed, sending all the information from IoT devices directly to ...the Cloud is not a feasible option for many applications with demanding requirements on real-time response, low latency, energy-aware processing and security. Such decentralization has led in a few years to the proliferation of new computing layers between Cloud and IoT, known as Edge computing layer, which comprises of small computing devices (e.g. Raspberry Pi) to larger computing nodes such as Gateways, Road Side Units, Mini Clouds, MEC Servers, Fog nodes, etc. In this paper, we study the challenges of processing an IoT data stream in an Edge computing layer. By using a real life data stream set arising from a car data stream as well as a real infrastructure using Raspberry Pi and Node-Red server, we highlight the complexities of achieving real time requirements of applications based on IoT stream processing.
•Investigate challenges of an IoT data stream processing at edge computing layer.•Analyze semantic data enrichment techniques.•Use real life car data stream for semantic data enrichment and anomaly detection.•Use a real infrastructure of Raspberry Pi and Node-Red server for system deployment.•Highlight the complexity of meeting real-time requirements on IoT stream processing.
This paper investigates a non-orthogonal multiple access (NOMA)-aided mobile edge computing (MEC) network with multiple sources and one computing access point (CAP), in which NOMA technology is ...applied to transmit multi-source data streams to CAP for computing. To measure the performance of the considered NOMA-aided MEC network, we first design the system cost as a linear weighting function of energy consumption and delay under the NOMA-aided MEC network. Moreover, we propose a deep Q network (DQN)-based offloading strategy to minimize the system cost by jointly optimizing the offloading ratio and transmission power allocation. Finally, we design experiments to demonstrate the effectiveness of the proposed strategy. Specifically, the designed strategy can decrease the system cost by about 15% compared with local computing when the number of sources is 5.
Due to its low latency and energy consumption, edge computing technology is essential in processing multi-source data streams from intelligent devices. This article investigates a mobile edge ...computing network aided by wireless power transfer (WPT) for multi-source data streams, where the wireless channel parameters and the characteristic of the data stream are varied. Moreover, we consider a practical communication scenario, where the devices with limited battery capacity cannot support the executing and transmitting of computational data streams under a given latency. Thus, WPT technology is adopted for this considered network to enable the devices to harvest energy from the power beacon. In further, by considering the device’s energy consumption and latency constraints, we propose an optimization problem under energy constraints. To solve this problem, we design a customized particle swarm optimization-based algorithm, which aims at minimizing the latency of the device processing computational data stream by jointly optimizing the charging and offloading strategies. Furthermore, simulation results illustrate that the proposed method outperforms other benchmark schemes in minimizing latency, which shows the proposed method’s superiority in processing the multi-source data stream.
To support multi-source data stream generated from Internet of Things devices, edge computing emerges as a promising computing pattern with low latency and high bandwidth compared to cloud computing. ...To enhance the performance of edge computing within limited communication and computation resources, we study a cloud-edge-end computing architecture, where one cloud server and multiple computational access points can collaboratively process the compute-intensive data streams that come from multiple sources. Moreover, a multi-source environment is considered, in which the wireless channel and the characteristic of the data stream are time-varying. To adapt to the dynamic network environment, we first formulate the optimization problem as a markov decision process and then decompose it into a data stream offloading ratio assignment sub-problem and a resource allocation sub-problem. Meanwhile, in order to reduce the action space, we further design a novel approach that combines the proximal policy optimization (PPO) scheme with convex optimization, where the PPO is used for the data stream offloading assignment, while the convex optimization is employed for the resource allocation. The simulated outcomes in this work can help the development of the application of the multi-source data stream.