Local search is a specialization of the web search that allows users to submit geographically constrained queries. However, one of the challenges for local search engines is to uniquely understand ...and locate the geographical intent of the query. Geographical constraints (or location references) in a local search are often incomplete and thereby suffer from the referent ambiguity problem where the same location name can mean several different possibilities. For instance, just the term "Springfield" by itself can refer to 30 different cities in the USA. Previous approaches to location disambiguation have generally been hand compiled heuristic models. In this paper, we examine a data-driven, machine learning approach to location disambiguation. Essentially, we separately train a Gradient Boosted Decision Tree (GBDT) model on thousands of desktop and mobile-based local searches and compare the performance to one of our previous heuristic based location disambiguation system (HLDS). The GBDT based approach shows promising results with statistically significant improvements over the HLDS approach. The error rate reduction is about 9% and 22% for the desktop-based and the mobile-based local searches respectively. Additionally, we examine the relative influence of various geographic and non-geographic features that help with the location disambiguation task. It is interesting to note that while the distance between the user and the intended location has been considered as an important variable, the relative influence of distance is secondary to the popularity of the location in the GBDT learnt models.
Click Through Rate (CTR) is an important metric for ad systems, job portals, recommendation systems. CTR impacts publisher's revenue, advertiser's bid amounts in "pay for performance" business ...models. We learn regression models using features of the job, optional click history of job, features of "related" jobs. We show that our models predict CTR much better than predicting avg. CTR for all job listings, even in absence of the click history for the job listing.
The rRT-PCR test, the current gold standard for the detection of coronavirus disease (COVID-19), presents with known shortcomings, such as long turnaround time, potential shortage of reagents, ...false-negative rates around 15-20%, and expensive equipment. The hematochemical values of routine blood exams could represent a faster and less expensive alternative.
Three different training data set of hematochemical values from 1,624 patients (52% COVID-19 positive), admitted at San Raphael Hospital (OSR) from February to May 2020, were used for developing machine learning (ML) models: the complete OSR dataset (72 features: complete blood count (CBC), biochemical, coagulation, hemogasanalysis and CO-Oxymetry values, age, sex and specific symptoms at triage) and two sub-datasets (COVID-specific and CBC dataset, 32 and 21 features respectively). 58 cases (50% COVID-19 positive) from another hospital, and 54 negative patients collected in 2018 at OSR, were used for internal-external and external validation.
We developed five ML models: for the complete OSR dataset, the area under the receiver operating characteristic curve (AUC) for the algorithms ranged from 0.83 to 0.90; for the COVID-specific dataset from 0.83 to 0.87; and for the CBC dataset from 0.74 to 0.86. The validations also achieved good results: respectively, AUC from 0.75 to 0.78; and specificity from 0.92 to 0.96.
ML can be applied to blood tests as both an adjunct and alternative method to rRT-PCR for the fast and cost-effective identification of COVID-19-positive patients. This is especially useful in developing countries, or in countries facing an increase in contagions.
Corporate financial distress prediction research has been ongoing for more than half a century, during which many models have emerged, among which ensemble learning algorithms are the most accurate. ...Most of the state-of-the-art methods of recent years are based on gradient boosted decision trees. However, most of them do not consider using feature importance for feature selection, and a few of them use the feature importance method with bias, which may not reflect the true importance of features. To solve this problem, a heuristic algorithm based on permutation importance (PIMP) is proposed to modify the biased feature importance measure in this paper. This method ranks and filters the features used by machine learning models, which not only improves accuracy but also makes the results more interpretable. Based on financial data from 4,167 listed companies in China between 2001 and 2019, the experiment shows that compared with using the random forest (RF) wrapper method alone, the bias in feature importance is indeed corrected by combining the PIMP method. After the redundant features are removed, the performance of most machine learning models is improved. The PIMP method is a promising addition to the existing financial distress prediction methods. Moreover, compared with traditional statistical learning models and other machine learning models, the proposed PIMP-XGBoost offers higher prediction accuracy and clearer interpretation, making it suitable for commercial use.
•The model Combines a corrected feature selection measure and XGBoost.•Permutation importance can correct the bias of feature importance.•The model is validated on Chinese listed companies datasets over five metrics.•The model is proved to outperform several benchmark techniques.•The feature importance and partial dependence plot enhance model interpretation.
Abstract
Background
Accurate diagnostic strategies to identify SARS-CoV-2 positive individuals rapidly for management of patient care and protection of health care personnel are urgently needed. The ...predominant diagnostic test is viral RNA detection by RT-PCR from nasopharyngeal swabs specimens, however the results are not promptly obtainable in all patient care locations. Routine laboratory testing, in contrast, is readily available with a turn-around time (TAT) usually within 1-2 hours.
Method
We developed a machine learning model incorporating patient demographic features (age, sex, race) with 27 routine laboratory tests to predict an individual’s SARS-CoV-2 infection status. Laboratory testing results obtained within 2 days before the release of SARS-CoV-2 RT-PCR result were used to train a gradient boosting decision tree (GBDT) model from 3,356 SARS-CoV-2 RT-PCR tested patients (1,402 positive and 1,954 negative) evaluated at a metropolitan hospital.
Results
The model achieved an area under the receiver operating characteristic curve (AUC) of 0.854 (95% CI: 0.829-0.878). Application of this model to an independent patient dataset from a separate hospital resulted in a comparable AUC (0.838), validating the generalization of its use. Moreover, our model predicted initial SARS-CoV-2 RT-PCR positivity in 66% individuals whose RT-PCR result changed from negative to positive within 2 days.
Conclusion
This model employing routine laboratory test results offers opportunities for early and rapid identification of high-risk SARS-CoV-2 infected patients before their RT-PCR results are available. It may play an important role in assisting the identification of SARS-CoV-2 infected patients in areas where RT-PCR testing is not accessible due to financial or supply constraints.
Effective analysis and prediction of carbon price can not only promote the development and maturity of carbon trading market, but also contribute to the rational allocation of carbon resources. In ...order to improve the prediction accuracy of carbon prices and provide more reference information for related researchers and market practitioners, this study proposes a novel carbon price forecasting model, which innovatively combines the comprehensive feature screening technology (CFS) and the method of probability estimation. First of all, the carbon price data is decomposed by the improved complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN), and the wavelet transform (WT) algorithm is used to denoise after obtaining the combination of intrinsic mode function (IMF) with different complexity. Then, a feature screening technology that combines the advantages of principal component analysis (PCA), random forest (RF) and gradient boosted decision tree (GBDT) methods is design to deeply extract the influencing factors of predictive variables. Finally, as an improved predictor, bidirectional gated recurrent unit (BIGRU) is used to establish a point prediction model, and on this basis, Gaussian process regression (GPR) is used to measure the probability interval of carbon price change. In the empirical analysis of the three markets, the hybrid model proposed in this study is better than other comparison models. Taking the Shanghai market as an example, the root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) of the model are 1.554, 1.081 and 0.028, respectively. In the other two markets, the hybrid model proposed in this study also performs best.
Display omitted
•A novel carbon price prediction model is proposed.•Improved feature extraction method increases the accuracy of the model.•Multi-factor analysis can provide more information for model.•Interval prediction results can provide richer decision information.
Financial threats are displaying a trend about the credit risk of commercial banks as the incredible improvement in the financial industry has arisen. In this way, one of the biggest threats faces by ...commercial banks is the risk prediction of credit clients. Recent studies mostly focus on enhancing the classifier performance for credit card default prediction rather than an interpretable model. In classification problems, an imbalanced dataset is also crucial to improve the performance of the model because most of the cases lied in one class, and only a few examples are in other categories. Traditional statistical approaches are not suitable to deal with imbalanced data. In this study, a model is developed for credit default prediction by employing various credit-related datasets. There is often a significant difference between the minimum and maximum values in different features, so Min-Max normalization is used to scale the features within one range. Data level resampling techniques are employed to overcome the problem of the data imbalance. Various undersampling and oversampling methods are used to resolve the issue of class imbalance. Different machine learning models are also employed to obtain efficient results. We developed the hypothesis of whether developed models using different machine learning techniques are significantly the same or different and whether resampling techniques significantly improves the performance of the proposed models. One-way Analysis of Variance is a hypothesis-testing technique, used to test the significance of the results. The split method is utilized to validate the results in which data has split into training and test sets. The results on imbalanced datasets show the accuracy of 66.9% on Taiwan clients credit dataset, 70.7% on South German clients credit dataset, and 65% on Belgium clients credit dataset. Conversely, the results using our proposed methods significantly improve the accuracy of 89% on Taiwan clients credit dataset, 84.6% on South German clients credit dataset, and 87.1% on Belgium clients credit dataset. The results show that the performance of classifiers is better on the balanced dataset as compared to the imbalanced dataset. It is also observed that the performance of data oversampling techniques are better than undersampling techniques. Overall, the Gradient Boosted Decision Tree method performs better than other traditional machine learning classifiers. The Gradient Boosted Decision Tree method gives the best results while utilizing the K-means SMOTE oversampling method. Using one-way ANOVA, the null hypothesis was rejected by a p-value <0.001, hence confirming that the proposed model improved performance is statistical significance. The interpretable model is also deployed on the web to ease the different stakeholders. This model will help commercial banks, financial organizations, loan institutes, and other decision-makers to predict the loan defaulter earlier.
Driver fatigue is increasingly a contributing factor for traffic accidents, so an effective method to automatically detect driver fatigue is urgently needed. In this study, in order to catch the main ...characteristics of the EEG signals, four types of entropies (based on the EEG signal of a single channel) were calculated as the feature sets, including sample entropy, fuzzy entropy, approximate entropy and spectral entropy. All feature sets were used as the input of a gradient boosting decision tree (GBDT), a fast and highly accurate boosting ensemble method. The output of GBDT determined whether a driver was in a fatigue state or not based on their EEG signals. Three state-of-the-art classifiers, k-nearest neighbor, support vector machine and neural network were also employed. To assess our method, several experiments including parameter setting and classification performance comparison were performed on 22 subjects. The results indicated that it is possible to use only one EEG channel to detect a driver fatigue state. The average highest recognition rate in this work was up to 94.0%, which could meet the needs of daily applications. Our GBDT-based method may assist in the detection of driver fatigue.
Parkinson’s disease (PD) is a neurological condition characterized by the disruption of both motor and non-motor functions. Given the absence of a definitive diagnostic method, it is crucial to ...uncover its root causes. Consequently, individuals displaying symptoms of Parkinson’s disease can promptly receive treatment and comprehensive care. To address this, our study aims to develop an AI-powered system capable of detecting Parkinson’s disease and subsequently evaluating the primary factors influencing its development. We collected 12 distinct datasets from the well-known PPMI database, covering various medical assessments such as motor abilities, olfaction, cognition, sleep patterns, and depressive symptoms. Subsequently, we refined this raw data using advanced search techniques to tailor it to our model’s requirements. Moreover, we introduced a novel labeling approach known as the majority voting algorithm. Following data preparation, we conducted Single and Multi-Modality analyses, focusing on single-treatment approaches and integrating multiple treatments for a comprehensive therapeutic strategy. To analyze these both, we employed five distinct Machine Learning algorithms. Notably, the Support Vector Machine (linear) emerged as the top performer, reaching an accuracy of 100% in both single and multimodality analysis. Furthermore, we employed four tree-based models for feature selection, with the Gradient Boosted Decision Tree excels in identifying the most significant features. Finally, we employed an Artificial Neural Network utilizing these key features, achieving the highest accuracy of 91.41%.
Artificial intelligence (AI) is entering medical imaging, mainly enhancing image reconstruction. Nevertheless, improvements throughout the entire processing, from signal detection to computation, ...potentially offer significant benefits. This work presents a novel and versatile approach to detector optimization using machine learning (ML) and residual physics. We apply the concept to positron emission tomography (PET), intending to improve the coincidence time resolution (CTR). PET visualizes metabolic processes in the body by detecting photons with scintillation detectors. Improved CTR performance offers the advantage of reducing radioactive dose exposure for patients. Modern PET detectors with sophisticated concepts and read-out topologies represent complex physical and electronic systems requiring dedicated calibration techniques. Traditional methods primarily depend on analytical formulations successfully describing the main detector characteristics. However, when accounting for higher-order effects, additional complexities arise matching theoretical models to experimental reality. Our work addresses this challenge by combining traditional calibration with AI and residual physics, presenting a highly promising approach. We present a residual physics-based strategy using gradient tree boosting and physics-guided data generation. The explainable AI framework SHapley Additive exPlanations (SHAPs) was used to identify known physical effects with learned patterns. In addition, the models were tested against basic physical laws. We were able to improve the CTR significantly (more than 20%) for clinically relevant detectors of 19 mm height, reaching CTRs of 185 ps (450-550 keV).