Leveraging LightGBM for Categorical Big Data Hancock, John; Khoshgoftaar, Taghi M.
2021 IEEE Seventh International Conference on Big Data Computing Service and Applications (BigDataService),
2021-Aug.
Conference Proceeding
LightGBM is a popular Gradient Boosted Decision Tree implementation for classification and regression tasks. Our contribution is to answer a research question regarding LightGBM. We would like to ...know which alternative yields better performance for classifying highly imbalanced Big Data with high-cardinality categorical features: relying entirely on LightGBM's Exclusive Feature Bundling as a way to encode categorical features, or using LightGBM's built-in encoding for categorical features? Our study of LightGBM revealed two alternatives for a Big Data classification task to do anomaly detection. We may one-hot encode the data into a sparse representation, and then rely entirely on LightGBM's Exclusive Feature Bundling to complete encoding of the categorical features. Exclusive Feature Bundling is LightGBM's optimization technique that exploits sparsity in features for reducing the dimensionality of a dataset. On the other hand, also because our data has categorical features, it is a candidate for LightGBM's built-in encoding technique for categorical features. Since we did not find a clear indication in a survey of related work for which direction to take - using Exclusive Feature Bundling or using LightGBM's built-in encoding for categorical features - we experiment with these options to determine the best one for highly imbalanced Big Data. Furthermore, we show LightGBM's built-in encoding is best in a statistically significant sense. Our work is important because it fills a gap in LightGBM-related literature on how to best handle categorical features in imbalanced Big Data with high-cardinality categorical features.
Medicare Fraud Detection using CatBoost Hancock, John; Khoshgoftaar, Taghi M.
2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI),
2020-Aug.
Conference Proceeding
In this study we investigate the performance of CatBoost in the task of identifying Medicare fraud. The Medicare claims data we use as input for CatBoost contain a number of categorical features. ...Some of these features, such as the procedure code and provider zip code, have thousands of possible values. One contribution we make in this study is to show how we use CatBoost to eliminate some data pre-processing steps that authors of related works take. A second contribution we make is to show improvements in CatBoost's performance in terms of Area Under the Receiver Operating Characteristic Curve (AUC), when we include another one of the categorical features (provider state) as input to CatBoost. We show that CatBoost attains better performance than XGBoost in the task of Medicare fraud detection with respect to the AUC metric. At a 99% confidence level (with p-value 0) our experiments show that XGBoost obtains a mean AUC value of 0.7615 while CatBoost obtains a mean AUC value of 0.7851, validating the significance of CatBoost's performance improvement over XGBoost. Moreover, when we include an additional categorical feature (healthcare provider state) in our data analysis, CatBoost yields a mean AUC value of 0.8902, which is also statistically signficant at a 99% confidence interval level (with p-value 0). Our empirical evidence clearly indicates CatBoost is a better alternative to XGBoost for Medicare fraud detection, especially when dealing with categorical features.
Working with Machine Learning algorithms and Big Data, one may be tempted to skip the process of hyperparameter tuning, since algorithms generally take longer to train on larger datasets. ...Hyperparameter tuning is not something we get for free; one must allocate either more computing time or resources to run more iterations of experiments with different hyperparameter settings to find their optimal values. For small datasets the extra resources needed may be negligible, but for larger datasets resources necessary for tuning may be considerable. In this study, due to the size of the data we use, we find experiments where we do hyperparameter tuning take many times longer to complete than experiments where we do not do any tuning. Here, our contribution is to provide evidence that the investment in computing resources for hyperparameter tuning pays off, since we show that, for experiments involving highly imbalanced Big Data, we obtain better results when we incorporate hyperparameter tuning. With regard to the performance of CatBoost and LightGBM classifiers, we compare both default and tuned hyperparameter values. Furthermore our experiments encompass different techniques for encoding high-cardinality categorical features. We find in all cases, regardless of the classifier or encoding technique for categorical features, classifiers with tuned values for hyperparameters yield better results than those with default values. To the best of our knowledge, we are the first to do such a study on hyperparameter tuning to analyze the performance of LightGBM and CatBoost in classifying highly imbalanced Big Data.
The objective of this study was to develop a 10-meter resolution land cover classification map using Sentinel-1 and Sentinel-2 satellite imagery along with the Dynamic World data set. A Gradient ...Boosted Decision Tree (GBDT) classification was employed with reference samples. To reduce the salt-and-pepper effect, a decision fusion approach was applied to the classification probabilities obtained from the GBDT and the CNN-based classification from Dynamic World. The model's accuracy increased when CNN-based classification probabilities were used as inputs of the GBDT classification, while the salt-and-pepper effect was reduced when a decision fusion was applied to those two probabilities. This study suggests that the traditional land cover classification approach with reference sampling can still be effective when integrating with CNN-based classification probabilities for rapid land cover classification mapping.
Correcting handwritten answer booklets manually can be a challenging task for professors, involving significant time and effort. To address this issue, the paper proposes an automated evaluation ...system that uses DL and NLP techniques. The suggested approach begins by extracting raw text from image files using a proven GCP OCR text extract model, which is well-known for its better accuracy and efficiency. Furthermore, Natural Language Processing methods like BERT and GPT-3 are used to extract keywords and summarize extensive answers. The suggested technique gives marks that are usually comparable to those issued by manual evaluation. Furthermore, the article suggests a web tool that simplifies the evaluation procedure. The application outputs the raw text of student answers and the answer key, a synopsis of the student's response, and the marks gained based on the extracted keywords.
Federated Learning (FL) is a paradigm for jointly training machine learning algorithms in a decentralized manner which allows for parties to communicate with an aggregator to create and train a ...model, without exposing the underlying raw data distribution of the local parties involved in the training process. Most research in FL has been focused on Neural Network-based approaches, however Tree-Based methods, such as XGBoost, have been underexplored in Federated Learning due to the challenges in overcoming the iterative and additive characteristics of the algorithm. Decision tree-based models, in particular XGBoost, can handle non-IID data, which is significant for algorithms used in Federated Learning frameworks since the underlying characteristics of the data are decentralized and have risks of being non-IID by nature. In this paper, we focus on investigating the effects of how Federated XGBoost is impacted by non-IID distributions by performing experiments on various sample size-based data skew scenarios and how these models perform under various non-IID scenarios. We conduct a set of extensive experiments across multiple different datasets and different data skew partitions. Our experimental results demonstrate that despite the various partition ratios, the performance of the models stayed consistent and performed close to or equally well against models that were trained in a centralized manner.
This article proposes an accurate Stacking Ensemble Classifier (SEC) for decentral Smart Grid control Stability Prediction. The proposed SEC consists of stacking two base classifiers; specifically, ...eXtreme Gradient Boosting machine (XGBoost) and Categorical boosting (Catboost), and one meta-classier, Light Gradient Boosting Machine (LGBM). The proposed technique shows an excellent ability to classify the grid instabilities using a supervised learning approach accurately. Extensive experiments have been conducted, demonstrating the superiority of the proposed SEC model over multiple benchmarks. In summary, this paper's main contributions consist of 1) proposing a new model-based ensemble learning 2) tailoring an efficient data-driven technique for grid stability detection and classification. Numerical results are to validate the proposed model's high effectiveness.
Readability is a crucial presentation attribute that web summarization algorithms consider while generating a querybaised web summary. Readability quality also forms an important component in ...real-time monitoring of commercial search-engine results since readability of web summaries impacts clickthrough behavior, as shown in recent studies, and thus impacts user satisfaction and advertising revenue.
The standard approach to computing the readability is to first collect a corpus of random queries and their corresponding search result summaries, and then each summary is then judged by a human for its readabilty quality. An average readability score is then reported. This process is time consuming and expensive. Besides, the manual evaluation process can not be used in the real-time summary generation process. In this paper we propose a machine learning approach to the problem. We use the corpus as described above and extract summary features that we think may characterize readability. We then estimate a model (gradient boosted decision tree) that predicts human judgments given the features. This model can then be used in real time to estimate the readability of new (unseen) web search summaries and also be used in the summary generation process.
We present results on approximately 5000 editorial judgments collected over the course of a year and show examples where the model predicts the quality well and where it disagrees with human judgments. We compare the results of the model to previous models of readability, most notably Collins-Thompson-Callan, Fog and Flesch-Kincaid, and see that our model shows substantially better correlation with editorial judgments as measured by Pearson's correlation coefficient. The learning algorithm also provides us with the relative importance of the features used.
Breast cancer is a disease that causes excessive fear in women around the world. The number of high death rates by breast cancer can be reduced by early detection. This can make breast cancer a ...disease that is easy to cure. A collection of datasets about breast cancer is used in the process of early detection. Early detection is carried out to analyze the state of the early stages of breast cancer patients. This research paper proposes machine learning methods, namely Generalized Linear Model, Logistic Regression, and Gradient Boosted Decision Tree to enhance the classification performance of Wisconsin Diagnostic Breast Cancer Data. The diagnosis results in two classes of cancer decisions which are malignant and benign by looking at evaluating the accuracy of the data classification test. The result shows that the Generalized Linear Model achieves the accuracy of 99.4%, which is higher than the accuracies of the previous studies for classifying the Wisconsin Diagnostic Breast Cancer dataset.