This study focuses on the problem of predicting nitrogen oxide (NOx) concentration at the inlet of selective catalytic reduction (SCR) reactors under variable load conditions of thermal power units. ...Some variables with strong correlation with NOx concentration at the SCR inlet were selected as auxiliary variables and the delay time between NOx concentration and auxiliary variables was determined by the mutual information method. Taking the delay time, dynamic time and prediction error into account, a real-time dynamic prediction model of NOx concentration based on the least square support vector machine was proposed. The model was tested using the actual historical data under variable load operating conditions of two thermal power units. Test results show that the proposed dynamic prediction model has high real-time prediction accuracy and satisfactory generalization ability.
A guide to machine learning for biologists Greener, Joe G; Kandathil, Shaun M; Moffat, Lewis ...
Nature reviews. Molecular cell biology,
01/2022, Letnik:
23, Številka:
1
Journal Article
Recenzirano
Odprti dostop
The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological ...processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed.
When developing a clinical prediction model, penalization techniques are recommended to address overfitting, as they shrink predictor effect estimates toward the null and reduce mean-square ...prediction error in new individuals. However, shrinkage and penalty terms (‘tuning parameters’) are estimated with uncertainty from the development data set. We examined the magnitude of this uncertainty and the subsequent impact on prediction model performance.
This study comprises applied examples and a simulation study of the following methods: uniform shrinkage (estimated via a closed-form solution or bootstrapping), ridge regression, the lasso, and elastic net.
In a particular model development data set, penalization methods can be unreliable because tuning parameters are estimated with large uncertainty. This is of most concern when development data sets have a small effective sample size and the model's Cox-Snell R2 is low. The problem can lead to considerable miscalibration of model predictions in new individuals.
Penalization methods are not a ‘carte blanche’; they do not guarantee a reliable prediction model is developed. They are more unreliable when needed most (i.e., when overfitting may be large). We recommend they are best applied with large effective sample sizes, as identified from recent sample size calculations that aim to minimize the potential for model overfitting and precisely estimate key parameters.
•When developing a clinical prediction model, penalization and shrinkage techniques are recommended to address overfitting.•Some methodology articles suggest penalization methods are a ‘carte blanche’ and resolve any issues to do with overfitting.•We show that penalization methods can be unreliable, as their unknown shrinkage and tuning parameter estimates are often estimated with large uncertainty.•Although penalization methods will, on average, improve on standard estimation methods, in a particular data set, they are often unreliable.•The most problematic data sets are those with small effective sample sizes and where the developed model has a Cox-Snell R2 far from 1, which is common for prediction models of binary and time-to-event outcomes.•Penalization methods are best used in situations when a sufficiently large development data set is available, as identified from sample size calculations to minimize the potential for model overfitting and precisely estimate key parameters.•When the sample size is adequately large, any of the studied penalization or shrinkage methods can be used, as they should perform similarly and better than unpenalized regression unless sample size is extremely large and Rapp2 is large.
Abstract
In order to deal with the problem of energy shortage, demand side management has attracted more and more attention. Taking demand side resources as an example to understand the regulation ...potential of demand side resources, this paper designs a hierarchical and partitioned dynamic regulation architecture for demand side adjustable resources, which can carry out the interconnection between multi-level subjects and efficient information interaction. The regulation potential prediction model of elastic load is established through the hierarchical and partitioned architecture, and the credible and adjustable potential of resident users under two regulation scenarios is obtained through calculation: participating in emergency load reduction and peak valley difference, which proves the effectiveness of the architecture.
In this study, we pay special attention to the effects of social capital in crowdfunding performance. Previous researches already investigate the influencing factors of crowdfunding performance, but ...the influence mechanism of social capital on crowdfunding performance is unclear. We first explore the effects of social capital on crowdfunding performance through an empirical study based on the large-scale real-world dataset with 103,582 campaigns collected from Indiegogo. The empirical results suggest social capital has remarkable influence on crowdfunding performance. Some interesting findings are indicated. For example, the longer descriptions about campaigns improve the crowdfunding performance, but the longer descriptions about campaign owners worsen the crowdfunding performance. In addition, we deeply test the moderating effects of campaign category. Then we further develop a prediction method for crowdfunding performance from the perspective of social capital. The experiments demonstrate the effectiveness of our social capital-based prediction model.
•Campaign owners' social capital has remarkable effects on crowdfunding performance.•A prediction model for fundraising performance based on social capital is proposed.•Campaign category moderates effects of social capital on crowdfunding performance.•A comprehensive representation of campaign owners' social capital is proposed.
Segment routing (SR) technology is a new network functional technology derived from MPLS technology and based on SDN. Combining SR with software-defined perimeter (SDP), a new network security ...technology, is expected to solve the traditional problems such as data monitoring, denial of service, and new threats such as loop attack and label detection faced by SR data plane. Focusing on the security management of access devices in the SR data plane, first, this paper proposes an SR security model SbSR (SDP-based SR) based on SDP trust enhancement architecture, then, two-level SDP AH trust verification mechanism and 4 trust management mechanisms including initial trust value, trust evaluation, trust renewal, trust inheritance are designed. In the trust evaluation mechanism as the core of the model, System cloud grey model (1,1) weighted Markov prediction model is introduced to obtain real-time trust based on the historical behavior of device nodes, and 4 indexes, namely benign message ratio, loyal forwarding ratio, forwarding ratio stationarity coefficient, packet rate stationarity coefficient, are introduced to distinguish malicious devices from normal devices. Finally, the simulation test results of 5 security functions and security costs show that the proposed architecture can solve port scanning, traffic monitoring, topology detection, loop attack, and DoS attack of SR network data plane with an average access delay cost of 2.84 s for each new network agent, and realize multi-faceted protection of SR network data plane.
Clinical prediction models aim to predict outcomes in individuals, to inform diagnosis or prognosis in healthcare. Hundreds of prediction models are published in the medical literature each year, yet ...many are developed using a dataset that is too small for the total number of participants or outcome events. This leads to inaccurate predictions and consequently incorrect healthcare decisions for some individuals. In this article, the authors provide guidance on how to calculate the sample size required to develop a clinical prediction model.
•After a clinical prediction model is developed, it is usually necessary to undertake an external validation study that examines the model's performance in new data from the same or different ...population. External validation studies should have an appropriate sample size, in order to estimate model performance measures precisely for calibration, discrimination and clinical utility.•Rules-of-thumb suggest at least 100 events and 100 nonevents. Such blanket guidance is imprecise, and not specific to the model or validation setting.•Our works shows that precision of performance estimates is affected by the model's linear predictor (LP) distribution, in addition to number of events and total sample size. Furthermore, sample sizes of 100 (or even 200) events and non-events can give imprecise estimates, especially for calibration.•Our new proposal uses a simulation-based sample size calculation, which accounts for the LP distribution and (mis)calibration in the validation sample, and calculates the sample size (and events) required conditional on these factors.•The approach requires the researcher to specify the desired precision for each performance measure of interest (calibration, discrimination, net benefit, etc), the model's anticipated LP distribution in the validation population, and whether or not the model is well calibrated. Guidance for how to specify these values is given, and R and Stata code is provided.
Sample size “rules-of-thumb” for external validation of clinical prediction models suggest at least 100 events and 100 non-events. Such blanket guidance is imprecise, and not specific to the model or validation setting. We investigate factors affecting precision of model performance estimates upon external validation, and propose a more tailored sample size approach.
Simulation of logistic regression prediction models to investigate factors associated with precision of performance estimates. Then, explanation and illustration of a simulation-based approach to calculate the minimum sample size required to precisely estimate a model's calibration, discrimination and clinical utility.
Precision is affected by the model's linear predictor (LP) distribution, in addition to number of events and total sample size. Sample sizes of 100 (or even 200) events and non-events can give imprecise estimates, especially for calibration. The simulation-based calculation accounts for the LP distribution and (mis)calibration in the validation sample. Application identifies 2430 required participants (531 events) for external validation of a deep vein thrombosis diagnostic model.
Where researchers can anticipate the distribution of the model's LP (eg, based on development sample, or a pilot study), a simulation-based approach for calculating sample size for external validation offers more flexibility and reliability than rules-of-thumb.
Explaining complex or seemingly simple machine learning models is an important practical problem. We want to explain individual predictions from such models by learning simple, interpretable ...explanations. Shapley value is a game theoretic concept that can be used for this purpose. The Shapley value framework has a series of desirable theoretical properties, and can in principle handle any predictive model. Kernel SHAP is a computationally efficient approximation to Shapley values in higher dimensions. Like several other existing methods, this approach assumes that the features are independent. Since Shapley values currently suffer from inclusion of unrealistic data instances when features are correlated, the explanations may be very misleading. This is the case even if a simple linear model is used for predictions. In this paper, we extend the Kernel SHAP method to handle dependent features. We provide several examples of linear and non-linear models with various degrees of feature dependence, where our method gives more accurate approximations to the true Shapley values.
Patients with COVID-19 in the intensive care unit (ICU) have a high mortality rate, and methods to assess patients' prognosis early and administer precise treatment are of great significance.
The aim ...of this study was to use machine learning to construct a model for the analysis of risk factors and prediction of mortality among ICU patients with COVID-19.
In this study, 123 patients with COVID-19 in the ICU of Vulcan Hill Hospital were retrospectively selected from the database, and the data were randomly divided into a training data set (n=98) and test data set (n=25) with a 4:1 ratio. Significance tests, correlation analysis, and factor analysis were used to screen 100 potential risk factors individually. Conventional logistic regression methods and four machine learning algorithms were used to construct the risk prediction model for the prognosis of patients with COVID-19 in the ICU. The performance of these machine learning models was measured by the area under the receiver operating characteristic curve (AUC). Interpretation and evaluation of the risk prediction model were performed using calibration curves, SHapley Additive exPlanations (SHAP), Local Interpretable Model-Agnostic Explanations (LIME), etc, to ensure its stability and reliability. The outcome was based on the ICU deaths recorded from the database.
Layer-by-layer screening of 100 potential risk factors finally revealed 8 important risk factors that were included in the risk prediction model: lymphocyte percentage, prothrombin time, lactate dehydrogenase, total bilirubin, eosinophil percentage, creatinine, neutrophil percentage, and albumin level. Finally, an eXtreme Gradient Boosting (XGBoost) model established with the 8 important risk factors showed the best recognition ability in the training set of 5-fold cross validation (AUC=0.86) and the verification queue (AUC=0.92). The calibration curve showed that the risk predicted by the model was in good agreement with the actual risk. In addition, using the SHAP and LIME algorithms, feature interpretation and sample prediction interpretation algorithms of the XGBoost black box model were implemented. Additionally, the model was translated into a web-based risk calculator that is freely available for public usage.
The 8-factor XGBoost model predicts risk of death in ICU patients with COVID-19 well; it initially demonstrates stability and can be used effectively to predict COVID-19 prognosis in ICU patients.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, UILJ, UKNU, UL, UM, UPUK