Data mining techniques are commonly used to construct models for identifying software modules that are most likely to contain faults. In doing so, an organization’s limited resources can be ...intelligently allocated with the goal of detecting and correcting the greatest number of faults. However, there are two characteristics of software quality datasets that can negatively impact the effectiveness of these models: class imbalance and class noise. Software quality datasets are, by their nature, imbalanced. That is, most of a software system’s faults can be found in a small percentage of software modules. Therefore, the number of fault-prone, fp, examples (program modules) in a software project dataset is much smaller than the number of not fault-prone, nfp, examples. Data sampling techniques attempt to alleviate the problem of class imbalance by altering a training dataset’s distribution. A program module contains class noise if it is incorrectly labeled. While several studies have been performed to evaluate data sampling methods, the impact of class noise on these techniques has not been adequately addressed. This work presents a systematic set of experiments designed to investigate the impact of both class noise and class imbalance on classification models constructed to identify fault-prone program modules. We analyze the impact of class noise and class imbalance on 11 different learning algorithms (learners) as well as 7 different data sampling techniques. We identify which learners and which data sampling techniques are most robust when confronted with noisy and imbalanced data.
Binary classification deals with identifying whether elements belong to one of two possible categories. Various metrics exist to evaluate the performance of such classification systems. It is ...important to study and contrast these metrics to find the best one for assessing a particular system. Despite extensive research in this field, a particular systematic comparison of these evaluation metrics remains an unaddressed area. The performance of a classifier is usually evaluated through the confusion matrix, a table including the count of accurate and inaccurate predictions for each category. To judge if one classifier is better than another, examining variations in the confusion matrix is necessary. However, no agreed-upon method exists for this analysis. This is crucial because different metrics may interpret and rate two confusion matrices differently. We introduce the Worthiness Benchmark (γ), a new concept useful to characterize the principles by which performance metrics rank classifiers. In particular, the Worthiness Benchmark is useful to assess how a metric evaluates the superiority among two classifiers by analyzing differences in their confusion matrices. Through this new concept, we are able to deal with the main challenge of selecting the best metric to evaluate a classifier. We then perform a γ-analysis on several binary classification metrics to outline the specific benchmarks these metrics follow when comparing different classifiers.
In numerous binary classification tasks, the two groups of instances are not equally represented, which often implies that the training data lack sufficient information to model the minority class ...correctly. Furthermore, many traditional classification models make arbitrarily overconfident predictions outside the range of the training data. These issues severely impact the deployment and usefulness of these models in real life. In this paper, we propose the boundary regularizing out-of-distribution (BROOD) sampler, which adds artificial data points on the edge of the training data. By exploiting these artificial samples, we are able to regularize the decision surface of discriminative machine learning models and make more prudent predictions. Next, it is crucial to correctly classify many positive instances in a limited pool of instances that can be investigated with the available resources. By smartly assigning predetermined nonuniform class probabilities outside the training data, we can emphasize certain data regions and improve classifier performance on various material classification metrics. The good performance of the proposed methodology is illustrated in a case study that consists of both benchmark balanced and imbalanced classification data sets.
A system for predicting apparent bidirectional permeability (Papp) across Caco-2 cells of diverse chemicals has been reported. The present study aimed to investigate the relationship between in ...silico-generated Papp (from apical to basal side, Papp A to B) for 301 substances with diverse structures and a binary classification of the reported roles of efflux P-glycoprotein or breast cancer resistant protein. The in silico log(Papp AtoB/Papp BtoA) values of 70 substances with reported active efflux and 231 substances with no reported active efflux were significantly different (p < 0.01). The probabilities of active efflux transport estimated by trivariate analysis with logMW, logDpH6.0, and logDpH7.4 for the 70 active-efflux-positive compounds were higher than those of the other 231 substances (p < 0.01); the area under the corresponding receiver operating characteristic (ROC) curve was 0.81. Further probability values estimated using a machine learning algorithm with 30 chemical descriptors as inputs yielded an area under the ROC curve of 0.79. Using a secondary set of 52 efflux-positive and 48 efflux-negative medicines, the final trivariate-generated probabilities resulted in no significant differences between these binary groups (p = 0.09); however, the final machine learning model demonstrated a good area under the ROC curve of 0.79. Consequently, a combination of the previously established system for generating the permeability coefficients across intestinal monolayers (a continuous variable) and the currently proposed system for predicting the roles of additional active efflux (a binary classification) could prove useful; high accuracy was achieved by applying machine learning using in silico-generated chemical descriptors.
This paper investigates the feature subset selection problem for the binary classification problem using logistic regression model. We developed a modified discrete particle swarm optimization (PSO) ...algorithm for the feature subset selection problem. This approach embodies an adaptive feature selection procedure which dynamically accounts for the relevance and dependence of the features included the feature subset. We compare the proposed methodology with the tabu search and scatter search algorithms using publicly available datasets. The results show that the proposed discrete PSO algorithm is competitive in terms of both classification accuracy and computational performance.
To evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite ...being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy and F
score computed on confusion matrices have been (and still are) among the most popular adopted metrics in binary classification tasks. However, these statistical measures can dangerously show overoptimistic inflated results, especially on imbalanced datasets.
The Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset.
In this article, we show how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F
score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario. We believe that the Matthews correlation coefficient should be preferred to accuracy and F
score in evaluating binary classification tasks by all scientific communities.
The spread of Corona Virus Disease 19 (COVID-19) in Indonesia is still relatively high and has not shown a significant decrease. One of the main reasons is due to the lack of supervision on the ...implementation of health protocols such as wearing masks in daily activities. Recently, state-of-the-art algorithms were introduced to automate face mask detection. To be more specific, the researchers developed various kinds of architectures for the detection of masks based on computer vision methods. This paper aims to evaluate well-known architectures, namely the ResNet50, VGG11, InceptionV3, EfficientNetB4, and YOLO (You Only Look Once) to recommend the best approach in this specific field. By using the MaskedFace-Net dataset, the experimental results showed that the EfficientNetB4 architecture has better accuracy at 95.77% compared to the YOLOv4 architecture of 93.40%, InceptionV3 of 87.30%, YOLOv3 of 86.35%, ResNet50 of 84.41%, VGG11 of 84.38%, and YOLOv2 of 78.75%, respectively. It should be noted that particularly for YOLO, the model was trained using a collection of MaskedFace-Net images that had been pre-processed and labelled for the task. The model was initially able to train faster with pre-trained weights from the COCO dataset thanks to transfer learning, resulting in a robust set of features expected for face mask detection and classification.
The Binary classification is the most challenging problem in machine learning. One of the most promising technique to solvethis problem is by implementing genetic programming (GP). GP is one of ...Evolutionary Algorithm (EA) that used to solveproblems that humans do not know how to solve it directly. The objectives of this research is to demonstrate the use ofgenetic programming in this type of problems; that is, other types of techniques are typically used, e.g., regression, artificialneural networks. Genetic programming presents an advantage compared to those techniques, which is that it does not needan a priori definition of its structure. The algorithm evolves automatically until finding a model that best fits a set of trainingdata. Feature engineering was considered to improve the accuracy. In this research, feature transformation and featurecreation were implemented. Thus, genetic programming can be considered as an alternative option for the development ofintelligent systems mainly in the pattern recognition field.
Antimicrobial peptides have emerged as a potential alternative to combat the growing threat towards antimicrobial resistance. Owing to a large number of possible combinations of twenty naturally ...occurring amino acids, it is extremely resource intensive to experimentally identify whether a given peptide has desired therapeutic properties. To expedite the screening of therapeutic peptides, we propose a classification framework that can simultaneously predict the antibacterial activity, hemotoxicity, and efficacy against three most common pathogens i.e., Staphylococcus aureus, Escherichia coli, and Pseudomonas aeruginosa for any given peptide. The proposed framework uses support vector machine algorithm with amino acid compositions, sequence analysis, and physicochemical properties as features to develop three binary classifiers. Our models resulted in accuracies of 97.3 %, 86.2 %, and 84.1 % for antibacterial activity, combined efficacy against all three pathogens, and hemotoxicity, respectively. Explainable machine learning algorithm was implemented for each model to elucidate meaningful insights. It was evident that physicochemical properties along with the occurrence of certain amino acids play the most important role in determining antibacterial activity, efficacy, and hemolytic activity of peptides. The entire framework is made accessible freely in form of a web tool, which will further aid in rapid screening of antibacterial peptides with high therapeutic potential.
Display omitted
•Framework to screen antibacterial peptides with high efficacy and low hemotoxicity.•Predicts efficacy against three pathogens – S. aureus, E. coli, and P. aeruginosa.•Insights extracted from proposed models are in agreement with experimental findings.•Physicochemical properties play a crucial role in determining therapeutic potential.•Amino acid sequence affects efficacy more than antibacterial or hemolytic activity.