Recently developed imbalanced data classification models are mainly focused on the majority class samples. In addition, several whale optimization-based feature reduction models are inefficient for ...high-dimensional data, readily fall into the local optimum, and are subject to difficulties associated with achieving a global optimal feature subset, due to high costs. To overcome the drawbacks, in this study, a two-stage feature reduction model using fuzzy neighborhood rough sets (FNRS) and the binary whale optimization algorithm (BWOA) was developed for imbalanced data classification. First, to indicate the sample fuzziness of mixed data, a similarity measure between samples based on fuzzy neighborhood was defined to investigate the similarity matrix and fuzzy neighborhood granule, and a new FNRS model was presented by constructing lower and upper approximations. By considering the uneven distribution of classes, the boundary-based feature significance measure was developed to minimize the influence of the uncertainty in boundary region for imbalanced data. Second, fuzzy neighborhood roughness and decision entropy were investigated based on FNRS, and by integrating these above measures, fuzzy neighborhood decision entropy was proposed to evaluate the fuzziness and roughness of the fuzzy neighborhood for imbalanced data. The external and internal significance metrics were proposed to achieve the preselected feature subset in the first stage. Third, in this second stage, a new control factor was defined to control the position of whales, and a novel fitness function was developed to evaluate the selected feature subset from imbalanced datasets. Thereafter, the immune regulation strategy of artificial immune was introduced into the BWOA to design the mixed selection probability, to divide the whale population. Two local interference strategies were applied to adjust the whale position and prevent BWOA trapped in the local optimum. Thus, an optimal feature subset was achieved by constantly iterating the BWOA. Finally, a two-stage feature reduction algorithm was designed to handle imbalanced and high-dimensional data, where the particle swarm optimization (PSO) algorithm was employed to determine the different optimized parameters for this two-stage algorithm. Experiments conducted on 22 datasets revealed that the proposed algorithm is efficient for two-class and multiclass datasets.
•The boundary-based feature significance measure was developed to minimize the influence of the uncertainty.•Fuzzy neighborhood decision entropy was proposed to evaluate the fuzziness and roughness of imbalanced data.•A fitness function was developed to evaluate the selected feature subset from imbalanced data.•A two-stage feature reduction algorithm was designed to handle imbalanced and high-dimensional data.
The quality of decisions in the renewable energy sector is as good as the quality of available data. This makes data quality a cornerstone in renewable energy system planning, designing, operation, ...and assessment. Unfortunately, such data is not always available and usually is cost-prohibitive. One solution for this issue is using the data of few representative days (RDs) instead of the full year for reduced costs of the data itself and the system simulations. A new framework is proposed in this study to distinguish these RDs based on meteorological features. The new framework represents an end-to-end pipeline, starting with measurements, data curing, feature extraction, clustering, and representative year construction. The analysis showed that increasing the number of RDs indeed improves the representativeness of the reconstructed year with disagreement indices as low as 1.041. Including system-irrelevant meteorological parameters was found to increase the disagreement index between original data and reconstructed year from 0.206 to 0.989. The proposed autoencoder feature extraction approach outperformed the conventional statistical one, especially for shallow autoencoders, where the disagreement index was reduced from 1.564 to 1.001. Finally, a brief case study of a standard solar water heating system was performed using TRNSYS v18 software to verify the proposed approach, where the absolute percentage deviation in the annual solar fraction was found to be only 0.278%. This study takes the first steps towards offering decision-makers, designers, and modelers a framework that provides high-quality and high-resolution data compatible with the elevating measurements and simulation cost.
Display omitted
•A framework is proposed for selecting representative meteorological days (RDs).•Shallow autoencoders outperform statistical feature extraction.•Disagreement indices between recorded and constructed years are as low as 1.041.•Using six RDs is a good tradeoff between cost and representativeness.•Only relevant metrological parameters to targeted systems should be considered.
Among various tasks of feature reduction, how to search qualified features and then construct the corresponding reduct which satisfies a given constraint is a hot topic. Constraint is the pre-defined ...condition in feature reduction, it rules two crucial aspects: 1) when the process of searching can be terminated to construct reduct; 2) whether a feature/set of features should be added into reduct pool or not. Obviously, such a task of feature reduction is constraint-dependent. However, some limitations may naturally emerge: 1) constraint-independent features are ignored; 2) fixed constraint hinders the diverse evaluations of features; 3) selected features based on single constraint are powerless to data perturbation. Therefore, an Ensembler Mixed with Pareto Optimality (EmPo) is developed in this study. Firstly, the principle of Restricted Pareto Optimality is proposed to identify constraint-independent features before performing constraint related searching, which indicates a two-stages strategy to feature reduction. Naturally, diverse evaluations of features are achieved in such a process. Secondly, a data perturbation w.r.t. either sample or feature aspect is employed to obtain multiple reductions of features. The objective is to further improve the stability of the classification results based on those multiple reducts. It should also be emphasized that EmPo is a general framework, and most existing approaches to feature reduction can be embedded into it to further improve their performances. Finally, the effectiveness of our EmPo is validated over 20 datasets within 4 different ratios of noisy label.
•Propose a framework for selecting qualified features.•Investigate the shortcomings of some existing methods.•The quality of features is related to evaluation conditions.•Ensemble strategy is useful in selecting features.
Most sophisticated cyber-attack technique used by the cyber criminals is creating and spreading malicious domain names or malicious URLs through email, messages, popups etc. Malicious URL are the web ...pages targeted towards the internet user to spread the malware, virus, warms etc once the user visited. Main intension of the attack is to steal the victim information, user credentials or install the malware in the victim’s system. So, it is necessary to adapt the system which should detect the malicious URLs and prevent from the attack. Researchers suggest numerous methods but machine learning based detection method performs better then methods. This paper presents the light weighted method which includes only lexical features of the URL. The result shows the Random Forest classifier performs better than the other classifiers in terms of accuracy.
Bluetooth (BT) mesh networks are used in several ad hoc network communication scenarios, including natural disasters, calamities, war zones, etc. However, there are situations where Bluetooth mesh ...networks are misused by anti-social elements during riots etc. As of now there no such research article that showcases detection of BT mesh networks using machine learning tools. Therefore, there is a need to build novel methods for detecting the very presence of Bluetooth mesh networks to aid the law enforcement agencies. This paper contributes to building a new dataset for the aid of machine learning tools to detect the presence of Bluetooth mesh networks. The dataset is extracted from a simulated Bluetooth mesh network with 52 BT network traffic attributes, using the Bridgefy application. Further, the dataset is optimized by identifying prime features, using label encoding, one-hot encoding, correlation analysis and evaluated by various Machine Learning (ML) tools. An accuracy of 99.6% is achieved after optimizing the feature set.
This paper aims to address two significant challenges of Deep Learning(DL) model generation, high computational cost involved during the training phase and data interpretability from high dimensional ...data. The computational cost is curbed using centralized as well as distributed matrix factorization technique known as CUR decomposition. CUR decomposition has the advantage of reducing the dimensions of the dataset without data transformation and minimal information loss. Though Singular Value De-composition(SVD) is the best matrix decomposition technique with least reconstruction error, it transforms the data and hence the original data cannot be interpreted from the decomposed matrices while applying DL technique. So CUR decomposition is leveraged to reduce the number of features in high-dimensional data while preserving the essential information. Extensive experimental analysis is conducted to evaluate and compare the performance of CUR and SVD techniques based on reconstruction error and time complexity in centralized as well as distributed setting. The results exhibit that CUR outperforms SVD in terms of computational efficiency, especially in a distributed setting where the decomposition time for CUR was a fraction compared to SVD. At the same time DL models generated from this reduced dataset exhibit comparable accuracy with the original high dimensional dataset as well as reduced dataset obtained through SVD. This work also enhances the accessibility of the CUR algorithm to users, for which an application is developed that enables users to execute the algorithm on any dataset. The user-friendly interface of the application facilitates the upload of datasets, configuration of desired decomposition rank, and execution of the algorithm with a single click.
An arrhythmia is a disorder where the heart beats out of its usual rhythm. It means the heart beats irregularly. The rate of a person’s heartbeat is determined by electrical signals flowing through ...the heart. These signals are produced by the Sino Atrial (SA) node in the heart which acts as a natural pacemaker. If these electrical signals are irregular, it leads to heart problem. In recent years, the individuals who are diagnosed with this condition has increased. In this paper we suggested a hybrid model for feature extraction and classification. Many deep learning algorithms and machine learning algorithms are used, from that two layered BiLSTM gives the best accuracy for feature extraction. Random forest gives the best accuracy for classification. Also feature reduction technique is used to minimize the training and testing time. The proposed hybrid model which comprised of BiLSTM and Random forest gives the best accuracy of 98.84% and shows the sensitivity and specificity of above 98% for all the classes. The proposed hybrid model with feature reduction technique produces an accuracy of 98.46% with lesser prediction time of 0.089 s. The hybrid BiLSTM + Random Forest + PCA model generates a higher average accuracy of 98.30% compared to LSTM + Random Forest + PCA hybrid model with 10-fold cross validation, which again ensures the consistency of the proposed models.
•Hybrid model for feature extraction & classification of arrhythmia using ECG signals.•ECG is denoised using DWT & segmented, given as input to DL for feature extraction.•These features are reduced using PCA and LDA and passed to ML for classification.•BiLSTM + PCA + Random Forest gives better performance.
Rapid classification of tumors that are detected in the medical images is of great importance in the early diagnosis of the disease. In this paper, a new liver and brain tumor classification method ...is proposed by using the power of convolutional neural network (CNN) in feature extraction, the power of discrete wavelet transform (DWT) in signal processing, and the power of long short-term memory (LSTM) in signal classification. A CNN-DWT-LSTM method is proposed to classify the computed tomography (CT) images of livers with tumors and to classify the magnetic resonance (MR) images of brains with tumors. The proposed method classifies liver tumors images as benign or malignant and then classifies brain tumor images as meningioma, glioma, and pituitary. In the hybrid CNN-DWT-LSTM method, the feature vector of the images is obtained from pre-trained AlexNet CNN architecture. The feature vector is reduced but strengthened by applying the single-level one-dimensional discrete wavelet transform (1-D DWT), and it is classified by training with an LSTM network. Under the scope of the study, images of 56 benign and 56 malignant liver tumors that were obtained from Fırat University Research Hospital were used and a publicly available brain tumor dataset were used. The experimental results show that the proposed method had higher performance than classifiers, such as K-nearest neighbors (KNN) and support vector machine (SVM). By using the CNN-DWT-LSTM hybrid method, an accuracy rate of 99.1% was achieved in the liver tumor classification and accuracy rate of 98.6% was achieved in the brain tumor classification. We used two different datasets to demonstrate the performance of the proposed method. Performance measurements show that the proposed method has a satisfactory accuracy rate at the liver tumor and brain tumor classifying.