Machine learning is increasingly being applied to neuroimaging data. However, most machine learning algorithms have not been designed to accommodate neuroimaging data, which typically has many more ...data points than subjects, in addition to multicollinearity and low signal-to-noise. Consequently, the relative efficacy of different machine learning regression algorithms for different types of neuroimaging data are not known. Here, we sought to quantify the performance of a variety of machine learning algorithms for use with neuroimaging data with various sample sizes, feature set sizes, and predictor effect sizes. The contribution of additional machine learning techniques – embedded feature selection and bootstrap aggregation (bagging) – to model performance was also quantified. Five machine learning regression methods – Gaussian Process Regression, Multiple Kernel Learning, Kernel Ridge Regression, the Elastic Net and Random Forest, were examined with both real and simulated MRI data, and in comparison to standard multiple regression. The different machine learning regression algorithms produced varying results, which depended on sample size, feature set size, and predictor effect size. When the effect size was large, the Elastic Net, Kernel Ridge Regression and Gaussian Process Regression performed well at most sample sizes and feature set sizes. However, when the effect size was small, only the Elastic Net made accurate predictions, but this was limited to analyses with sample sizes greater than 400. Random Forest also produced a moderate performance for small effect sizes, but could do so across all sample sizes. Machine learning techniques also improved prediction accuracy for multiple regression. These data provide empirical evidence for the differential performance of various machines on neuroimaging data, which are dependent on number of sample size, features and effect size.
•The choice of machine learning algorithm influenced prediction accuracy.•Sample size was important: prediction accuracy generally increased once N ≥ 400.•The Elastic Net performed well at a range of effect sizes, relative to other methods.•Random Forest performed well at small effect sizes.•Gaussian Process Regression performed well at large effect sizes.
Defect Number Prediction (DNP) models can offer more benefits than classification-based defect prediction. Recently, many researchers proposed to employ regression algorithms for DNP, and found that ...the algorithms achieve low Average Absolute Error (AAE) and high Pred(0.3) values. However, since the defect datasets generally contain many non-defective modules, even if a DNP model predicts the number of defects in all modules as zero, the AAE value of the model will be low and Pred(0.3) value will be high. Therefore, the good performance of the regression algorithms in terms of AAE and Pred(0.3) may be questioned due to the imbalanced distribution of the number of defects.
To revisit the impact of regression algorithms for predicting the precise number of defects.
We examine the practical effects of 12 widely-used regression algorithms, two data resampling algorithm (SmoteR and ROS), and three ensemble learning algorithms (gradient boosting regression, AdaBoost.R2, and Bagging), one feature selection method (information gain) and one parameter optimization method (grid search) for predicting the precise number of defects on the 18 PROMISE datasets. We propose to evaluate the AAE and Pred(0.3) values for the modules with different numbers of defects separately.
The AAE values for defective modules are very high and the Pred(0.3) values are very low, i.e., the regression algorithms are very inaccurate for predicting the precise number of defects in defective modules.
The problem of predicting the precise number of defects via regression algorithms is far from being solved. We recommend that software testers use regression algorithms to rank modules for testing resource allocation, rather than predict the precise number of defects to evaluate the software reliability and maintenance effort. In addition, most existing DNP studies employing the whole AAE and Pred(0.3) values of all modules as the evaluation metrics for the proposed DNP algorithms should be revisited.
Abstract
Workability is one of the key property of concrete which is governed by water cement ratio. In order to improve the workability of concrete without any variations in water cement ratio ...Superplasticizers(SPs) are added. Cement paste helps us to analyze the property of fresh concrete where the dispersion of cement particle is taken into account. SP’s Cement dispersive properties are governed by dosage and the family. Various dosages and families of SP are considered for estimating workability feature of cement paste which is picked for investigating on rheological properties through Mini slump spread diameter. The prime motive of this analysis includes measuring the workability of different superplasticizers by conducting a minislump test and hence modelling the flow rate of the superplasticized Portland Pozzolona Cement (PPC)paste using the application of random forest(RF), decision tree(DT) and multiple regression algorithms. Testing and training data for a model were 287 unique mixture compositions at a water by cement ratio was 0.37. This mixture was tested experimentally in a laboratory using four types of locally available PPC’s and of SP which can be broadly categorised in to four families. Amount of seven types of SP brands, water content, cement weight were the input parameters for the model and flow rate was the output parameter. The model’s predicted and experimentally measured values of flow speed were compared and the amount of deviation was recorded.
In this paper, for the first time, evaluation of neutron cross section data of 100Mo (n, 2n) 99Mo reaction is performed using experimental data available in IAEA-EXFOR database library and nuclear ...model-based data generated using Talys 1.9 code by applying a novel method of combining Kalman filtering technique with Machine Learning (ML) regression algorithms. The neutron cross section data evaluation has been performed after a detailed study of all the EXFOR papers corresponding to 100Mo (n, 2n) 99Mo reaction and nuclear model based cross section generation by executing T6 random input files using Talys 1.9 code. The evaluated curve generated is then compared with the existing evaluated curve of 100Mo (n, 2n) 99Mo reaction from nuclear data libraries such as ENDF/B-VIII.0, JEFF-3.3, JENDL-4.0, CENDL-3.1 and TENDL 2017 and found to be in good agreement with them. Chi-square and generalized Chi-square tests were employed to assess the proposed evaluation techniques and found them to be good in estimating evaluated mean values and evaluated uncertainties of cross section.
Air pollution is a risk factor for many diseases that can lead to death. Therefore, it is important to develop forecasting mechanisms that can be used by the authorities, so that they can anticipate ...measures when high concentrations of certain pollutants are expected in the near future. Machine Learning models, in particular, Deep Learning models, have been widely used to forecast air quality. In this paper we present a comprehensive review of the main contributions in the field during the period 2011–2021. We have searched the main scientific publications databases and, after a careful selection, we have considered a total of 155 papers. The papers are classified in terms of geographical distribution, predicted values, predictor variables, evaluation metrics and Machine Learning model.
•We tested 66 calibration methods for modeling and predicting soil NPK.•Proper calibration methods could improve model performance.•We suggested a framework to select the best calibration ...methods.•The prediction of P and K could be done by hyperspectral VNIR, but N cannot.
Soil nutrients, including available nitrogen (N), phosphorous (P), and potassium (K), are critical properties for monitoring soil fertility and function. Spectroscopy analysis has proven to be a rapid and effective means for predicting soil properties, in general, and NPK, in particular. However, different calibration methods, including preprocessing transformations (PPTs) and regression algorithms (RAs), considerably affect the performance of prediction models. In this study, raw spectrum and 21 PPTs, combined with three RAs, for a total of 66 calibration methods, were investigated for modeling and predicting soil NPK using hyperspectral VNIR data (400–1000nm). The ratio of performance to deviation (RPD) of validation set was selected to evaluate the prediction accuracy and the ratio between the interpretable sum squared deviation and the real sum squared deviation (SSR/SST) of the validation set was also used to evaluate the explanatory power of the models. It was found that there is a tradeoff between RPD and SSR/SST values; under this tradeoff, the multiplicative scatter correction, combined with the back-propagation neural network, was preferred for predicting P (RPD=2.23, SSR/SST=0.81). The Savitzky-Golay filtering+logarithmic transformation, combined with the partial least squares – regression, was preferred for predicting K (RPD=1.47, SSR/SST=0.95). However, with extremely low RPD and SSR/SST values, the prediction of N was unreliable in this study. The evaluation approach presented in this paper suggests a framework for choosing a calibration method for spectroscopy analysis for predicting soil NPK and perhaps some other properties.
Current and upcoming airborne and spaceborne imaging spectrometers lead to vast hyperspectral data streams. This scenario calls for automated and optimized spectral dimensionality reduction ...techniques to enable fast and efficient hyperspectral data processing, such as inferring vegetation properties. In preparation of next generation biophysical variable retrieval methods applicable to hyperspectral data, we present the evaluation of 11 dimensionality reduction (DR) methods in combination with advanced machine learning regression algorithms (MLRAs) for statistical variable retrieval. Two unique hyperspectral datasets were analyzed on the predictive power of DR+MLRA methods to retrieve leaf area index (LAI): (1) a simulated PROSAIL reflectance data (2101 bands), and (2) a field dataset from airborne HyMap data (125 bands). For the majority of MLRAs, applying first a DR method leads to superior retrieval accuracies and substantial gains in processing speed as opposed to using all bands into the regression algorithm. This was especially noticeable for the PROSAIL dataset: in the most extreme case, using the classical linear regression (LR), validation results RCV2 (RMSECV) improved from 0.06 (12.23) without a DR method to 0.93 (0.53) when combining it with a best performing DR method (i.e., CCA or OPLS). However, these DR methods no longer excelled when applied to noisy or real sensor data such as HyMap. Then the combination of kernel CCA (KCCA) with LR, or a classical PCA and PLS with a MLRA showed more robust performances (RCV2 of 0.93). Gaussian processes regression (GPR) uncertainty estimates revealed that LAI maps as trained in combination with a DR method can lead to lower uncertainties, as opposed to using all HyMap bands. The obtained results demonstrated that, in general, biophysical variable retrieval from hyperspectral data can largely benefit from dimensionality reduction in both accuracy and computational efficiency.
Due to its extensive steps and trials, drug discovery is a long and expensive process. In the last decade, as also hard pressed by the COVID-19 pandemic, the screening process could be assisted with ...the advancement in computational technology including the application of Machine Learning. The classification task in Machine Learning has become one of the major approaches for drug discovery. Unfortunately, this practice uses discretized labels that might lead to the loss of quantitative properties that could be meaningful. Therefore, in this paper, we aim to compare various Machine Learning regression algorithms in predicting inhibitory bioactivity, specifically the IC50 value, with the SARS-CoV-2 Replicase Polyprotein 1ab as the target. With 1,138 non-duplicated data downloaded from the ChEMBL database that was engineered into four dataset variances, 42 regression algorithms were utilized for the prediction. We found that there are computational challenges to the use of regression algorithms in predicting bioactivity, for only a handful and a specific dataset variance that returned valid performance parameters upon testing. The three that yielded the highest counts of valid performance parameters are the Histogram Gradient Boosting Regressor (HGBR), Light Gradient Boosting Machine Regressor (LGBR), and Random Forest Regression (RFR). Further statistical analyses show that there is no significant difference between these three algorithms, except for the time taken for training and testing the model, where the LGBR excels. Therefore, these three algorithms should be primarily considered for the study with the same nature.