Khammam district in Telangana, India has gained notoriety for the increasing number of farmer suicides attributed to the augmenting crop failures. Climate change, causing sporadic and uneven rains in ...the largely agricultural state has increased the strain on the already dwindling water table. Hence, there is a need for an in-depth analysis into the current state of these resources for their sustainable utilization. This study deploys 21 factors for predicting the groundwater potential of the region. An inventory of 126 wells was utilized to construct the dataset with the influencing factors. The statistical method of Frequency ratio (FR) and a machine learning (ML) approach of Gradient Boosted Decision Trees with Greedy feature selection (GA-GBDT) have been applied. GA-GBDT model (accuracy: 81%) outperformed the FR model (accuracy: 63%) and it was deduced that ML has the capability to perform equally well and even better than the traditional statistical approaches in similar studies. The models were utilized to generate groundwater potential maps for the region. The FR model predicted 78 sq.km as having a very high potential to yield groundwater, while GA-GBDT estimated it to be 152 sq.km. The results could play a vital role in irrigation management and city planning.
Cloud vendors such as Amazon (AWS) have started to offer FPGAs in addition to GPUs and CPU in their computing on-demand services. In this work we explore design space trade-offs of implementing a ...state-of-the-art machine learning library for Gradient-boosted decision trees (GBDT) on Amazon cloud and compare the scalability, performance, cost and accuracy with best known CPU and GPU implementations from literature. Our evaluation indicates that depending on the dataset, an FPGA-based implementation of the bottleneck computation kernels yields a speed-up anywhere from 3X to 10X over a GPU and 5X to 33X over a CPU. We show that smaller bin size results in better performance on a FPGA, but even with a bin size of 16 and a fixed point implementation the degradation in terms of accuracy on a FPGA is relatively small, around 1.3%–3.3% compared to a floating point implementation with 256 bins on a CPU or GPU.
•3X to 10X improvement can be obtained on performance-critical kernels compared to an optimized GPU implementation.•Smaller bin sizes yield higher performance with reasonably small degradation in accuracy.•Cost of FPGA computing per hour must come down significantly for FPGAs to be a viable alternative to GPUs on Amazon Cloud.•HW microservices could be a good opportunity for FPGAs in the cloud.
We present the work that allowed us to win the Next-Place Prediction task of the Nokia Mobile Data Challenge. Using data collected from the smartphones of 80 users, we explore the characteristics of ...their mobility traces. We then develop three families of predictors, including tailored models and generic algorithms, to predict, based on instantaneous information only, the next place a user will visit. These predictors are enhanced with aging techniques that allow them to adapt quickly to the users’ changes of habit. Finally, we devise various strategies to blend predictors together and take advantage of their diversity, leading to relative improvements of up to 4%.
OBJECTIVETo explore an artificial intelligence approach based on gradient-boosted decision trees for prediction of all-cause mortality at an intensive care unit, comparing its performance to a recent ...logistic regression system in the literature, and a logistic regression model built on the same platform. METHODSA gradient-boosted decision trees model and a logistic regression model were trained and tested with the Medical Information Mart for Intensive Care database. The 1-hour resolution physiological measurements of adult patients, collected during 5 hours in the intensive care unit, consisted of eight routine clinical parameters. The study addressed how the models learn to categorize patients to predict intensive care unit mortality or survival within 12 hours. The performance was evaluated with accuracy statistics and the area under the Receiver Operating Characteristic curve. RESULTSThe gradient-boosted trees yielded an area under the Receiver Operating Characteristic curve of 0.89, compared to 0.806 for the logistic regression. The accuracy was 0.814 for the gradient-boosted trees, compared to 0.782 for the logistic regression. The diagnostic odds ratio was 17.823 for the gradient-boosted trees, compared to 9.254 for the logistic regression. The Cohen's kappa, F-measure, Matthews correlation coefficient, and markedness were higher for the gradient-boosted trees. CONCLUSIONThe discriminatory power of the gradient-boosted trees was excellent. The gradient-boosted trees outperformed the logistic regression regarding intensive care unit mortality prediction. The high diagnostic odds ratio and markedness values for the gradient-boosted trees are important in the context of the studied unbalanced dataset.
Hyperparameter tuning is the collection of techniques to discover optimal values for settings we supply to machine learning algorithms. Put another way, hyperparameters are not optimized by the ...algorithm. When researching Big Data, we face the dilemma of whether it will be useful to do hyperparameter tuning with the maximum possible amount of data. This is because hyperparameter tuning may consume far more resources than conducting a single experiment with default values for hyperparameters. Each combination of algorithm settings results in an additional experiment. Here, we show that hyperparameter tuning with all available data is beneficial in the scope of our experiments. We conduct experiments with three Big Data Medicare Insurance Claims datasets. The experiments are exercises in Medicare fraud detection. We show that for each dataset, we obtain better performance from LightGBM and CatBoost classifiers with tuned hyperparameters. Since some features of the data we are working with are high cardinality categorical features, we have an opportunity to try different encoding techniques in our experiments. We find that across the different encoding techniques, hyperparameter tuning Provides an improvement in the performance of both LightGBM and CatBoost.
Purpose
This paper aims to present a robust prediction method for estimating the quality of electronic products assembled with pin-in-paste soldering technology. A specific board quality factor was ...also defined which describes the expected yield of the board assembly.
Design/methodology/approach
Experiments were performed to obtain the required input data for developing a prediction method based on decision tree learning techniques. A Type 4 lead-free solder paste (particle size 20–38 µm) was deposited by stencil printing with different printing speeds (from 20 mm/s to 70 mm/s) into the through-holes (0.8 mm, 1 mm, 1.1 mm, 1.4 mm) of an FR4 board. Hole-filling was investigated with X-ray analyses. Three test cases were evaluated.
Findings
The optimal parameters of the algorithm were determined as: subsample is 0.5, learning rate is 0.001, maximum tree depth is 6 and boosting iteration is 10,000. The mean absolute error, root mean square error and mean absolute percentage error resulted in 0.024, 0.03 and 3.5, respectively, on average for the prediction of the hole-filling value, based on the printing speed and hole-diameter after optimisation. Our method is able to predict the hole-filling in pin-in-paste technology for different through-hole diameters.
Originality/value
No research works are available in current literature regarding machine learning techniques for pin-in-paste technology. Therefore, we decided to develop a method using decision tree learning techniques for supporting the design of the stencil printing process for through-hole components and pin-in-paste technology. The first pass yield of the assembly can be enhanced, and the reflow soldering failures of pin-in-paste technology can be significantly reduced.
A network of moored hydrophones is an effective way of monitoring seismicity of oceanic ridges since it allows detection and localization of underwater events by recording generated T waves. The high ...cost of ship time necessitates long periods (normally a year) of autonomous functioning of the hydrophones, which results in very large data sets. The preliminary but indispensable part of the data analysis consists of identifying all T wave signals. This process is extremely time consuming if it is done by a human operator who visually examines the entire database. We propose a new method for automatic signal discrimination based on the Gradient Boosted Decision Trees technique that uses the distribution of signal spectral power among different frequency bands as the discriminating characteristic. We have applied this method to automatically identify the types of acoustic signals in data collected by two moored hydrophones in the North Atlantic. We show that the method is capable of efficiently resolving the signals of seismic origin with a small percentage of wrong identifications and missed events: 1.2% and 0.5% for T waves and 14.5% and 2.8% for teleseismic P waves, respectively. In addition, good identification rates for signals of other types (iceberg and ship generated) are obtained. Our results indicate that the method can be successfully applied to automate the analysis of other (not necessarily acoustic) databases provided that enough information is available to describe statistical properties of the signals to be identified.
Key Points
New automatic signal discrimination method is proposedThe method uses Gradient Boosted Decision TreesMethod can be useful in analysis of any large data sets
This paper provides detailed information about team Leustagos’ approach to the wind power forecasting track of GEFCom 2012. The task was to predict the hourly power generation at seven wind farms, 48 ...hours ahead. The problem was addressed by extracting time- and weather-related features, which were used to build gradient-boosted decision trees and linear regression models. This approach achieved first place in both the public and private leaderboards.
Due to the size of the data involved, performance is an important consideration in the task of detecting fraudulent Medicare insurance claims. We evaluate CatBoost and XGBoost on the task of Medicare ...fraud detection, and report performance in terms of running time and Area Under the Receiver Operating Characteristic Curve (AUC). We show that adding a categorical feature for XGBoost and CatBoost improves performance in terms of AUC, and that CatBoost's performance is higher in a statistically significant sense. Moreover, we conduct experiments to find the optimal number of decision trees to use for XGBoost and CatBoost in the task of Medicare fraud detection. This is an important contribution because the number of trees in the ensemble governs overall resource consumption of a Gradient Boosted Decision Tree implementation. We find that with a purely numerical dataset, CatBoost and XGBoost yield nearly equivalent performance in terms of AUC, and XGBoost has a shorter training time. With respect to Medicare fraud detection, to the best of our knowledge, this is the first study to evaluate the performance of CatBoost and XGBoost in terms of running time and AUC on highly imbalanced, Big Data. Our contribution of evaluating running time performance on a large imbalanced dataset benefits researchers looking for more efficient utilization of valuable resources.