Each year, thousands of clinical prediction models are developed to make predictions (e.g. estimated risk) to inform individual diagnosis and prognosis in healthcare. However, most are not reliable ...for use in clinical practice.
We discuss how the creation of a prediction model (e.g. using regression or machine learning methods) is dependent on the sample and size of data used to develop it-were a different sample of the same size used from the same overarching population, the developed model could be very different even when the same model development methods are used. In other words, for each model created, there exists a multiverse of other potential models for that sample size and, crucially, an individual's predicted value (e.g. estimated risk) may vary greatly across this multiverse. The more an individual's prediction varies across the multiverse, the greater the instability. We show how small development datasets lead to more different models in the multiverse, often with vastly unstable individual predictions, and explain how this can be exposed by using bootstrapping and presenting instability plots. We recommend healthcare researchers seek to use large model development datasets to reduce instability concerns. This is especially important to ensure reliability across subgroups and improve model fairness in practice.
Instability is concerning as an individual's predicted value is used to guide their counselling, resource prioritisation, and clinical decision making. If different samples lead to different models with very different predictions for the same individual, then this should cast doubt into using a particular model for that individual. Therefore, visualising, quantifying and reporting the instability in individual-level predictions is essential when proposing a new model.
Background
Identification of biomarkers that predict severe Crohn’s disease is an urgent unmet research need, but existing research is piecemeal and haphazard.
Objective
To identify biomarkers that ...are potentially able to predict the development of subsequent severe Crohn’s disease.
Design
This was a prognostic systematic review with meta-analysis reserved for those potential predictors with sufficient existing research (defined as five or more primary studies).
Data sources
PubMed and EMBASE searched from inception to 1 January 2016, updated to 1 January 2018.
Review methods
Eligible studies were studies that compared biomarkers in patients who did or did not subsequently develop severe Crohn’s disease. We excluded biomarkers that had insufficient research evidence. A clinician and two statisticians independently extracted data relating to predictors, severe disease definitions, event numbers and outcomes, including odds/hazard ratios. We assessed risk of bias. We searched for associations with subsequent severe disease rather than precise estimates of strength. A random-effects meta-analysis was performed separately for odds ratios.
Results
In total, 29,950 abstracts yielded just 71 individual studies, reporting 56 non-overlapping cohorts. Five clinical biomarkers (Montreal behaviour, age, disease duration, disease location and smoking), two serological biomarkers (anti-
Saccharomyces cerevisiae
antibodies and anti-flagellin antibodies) and one genetic biomarker (nucleotide-binding oligomerisation domain-containing protein 2) displayed statistically significant prognostic potential. Overall, the strongest association with subsequent severe disease was identified for Montreal B2 and B3 categories (odds ratio 4.09 and 6.25, respectively).
Limitations
Definitions of severe disease varied widely, and some studies confounded diagnosis and prognosis. Risk of bias was rated as ‘high’ in 92% of studies overall. Some biomarkers that are used regularly in daily practice, for example C-reactive protein, were studied too infrequently for meta-analysis.
Conclusions
Research for individual biomarkers to predict severe Crohn’s disease is scant, heterogeneous and at a high risk of bias. Despite a large amount of potential research, we encountered relatively few biomarkers with data sufficient for meta-analysis, identifying only eight biomarkers with potential predictive capability.
Future work
We will use existing data sets to develop and then validate a predictive model based on the potential predictors identified by this systematic review. Contingent on the outcome of that research, a prospective external validation may prove clinically desirable.
Study registration
This study is registered as PROSPERO CRD42016029363.
Funding
This project was funded by the National Institute for Health Research (NIHR) Health Technology Assessment programme and will be published in full in
Health Technology Assessment
; Vol. 25, No. 45. See the NIHR Journals Library website for further project information.
In prediction model research, external validation is needed to examine an existing model's performance using data independent to that for model development. Current external validation studies often ...suffer from small sample sizes and consequently imprecise predictive performance estimates. To address this, we propose how to determine the minimum sample size needed for a new external validation study of a prediction model for a binary outcome. Our calculations aim to precisely estimate calibration (Observed/Expected and calibration slope), discrimination (C‐statistic), and clinical utility (net benefit). For each measure, we propose closed‐form and iterative solutions for calculating the minimum sample size required. These require specifying: (i) target SEs (confidence interval widths) for each estimate of interest, (ii) the anticipated outcome event proportion in the validation population, (iii) the prediction model's anticipated (mis)calibration and variance of linear predictor values in the validation population, and (iv) potential risk thresholds for clinical decision‐making. The calculations can also be used to inform whether the sample size of an existing (already collected) dataset is adequate for external validation. We illustrate our proposal for external validation of a prediction model for mechanical heart valve failure with an expected outcome event proportion of 0.018. Calculations suggest at least 9835 participants (177 events) are required to precisely estimate the calibration and discrimination measures, with this number driven by the calibration slope criterion, which we anticipate will often be the case. Also, 6443 participants (116 events) are required to precisely estimate net benefit at a risk threshold of 8%. Software code is provided.
When developing a clinical prediction model, penalization techniques are recommended to address overfitting, as they shrink predictor effect estimates toward the null and reduce mean-square ...prediction error in new individuals. However, shrinkage and penalty terms (‘tuning parameters’) are estimated with uncertainty from the development data set. We examined the magnitude of this uncertainty and the subsequent impact on prediction model performance.
This study comprises applied examples and a simulation study of the following methods: uniform shrinkage (estimated via a closed-form solution or bootstrapping), ridge regression, the lasso, and elastic net.
In a particular model development data set, penalization methods can be unreliable because tuning parameters are estimated with large uncertainty. This is of most concern when development data sets have a small effective sample size and the model's Cox-Snell R2 is low. The problem can lead to considerable miscalibration of model predictions in new individuals.
Penalization methods are not a ‘carte blanche’; they do not guarantee a reliable prediction model is developed. They are more unreliable when needed most (i.e., when overfitting may be large). We recommend they are best applied with large effective sample sizes, as identified from recent sample size calculations that aim to minimize the potential for model overfitting and precisely estimate key parameters.
•When developing a clinical prediction model, penalization and shrinkage techniques are recommended to address overfitting.•Some methodology articles suggest penalization methods are a ‘carte blanche’ and resolve any issues to do with overfitting.•We show that penalization methods can be unreliable, as their unknown shrinkage and tuning parameter estimates are often estimated with large uncertainty.•Although penalization methods will, on average, improve on standard estimation methods, in a particular data set, they are often unreliable.•The most problematic data sets are those with small effective sample sizes and where the developed model has a Cox-Snell R2 far from 1, which is common for prediction models of binary and time-to-event outcomes.•Penalization methods are best used in situations when a sufficiently large development data set is available, as identified from sample size calculations to minimize the potential for model overfitting and precisely estimate key parameters.•When the sample size is adequately large, any of the studied penalization or shrinkage methods can be used, as they should perform similarly and better than unpenalized regression unless sample size is extremely large and Rapp2 is large.
Clinical prediction models provide individualized outcome predictions to inform patient counseling and clinical decision making. External validation is the process of examining a prediction model's ...performance in data independent to that used for model development. Current external validation studies often suffer from small sample sizes, and subsequently imprecise estimates of a model's predictive performance. To address this, we propose how to determine the minimum sample size needed for external validation of a clinical prediction model with a continuous outcome. Four criteria are proposed, that target precise estimates of (i) R2 (the proportion of variance explained), (ii) calibration‐in‐the‐large (agreement between predicted and observed outcome values on average), (iii) calibration slope (agreement between predicted and observed values across the range of predicted values), and (iv) the variance of observed outcome values. Closed‐form sample size solutions are derived for each criterion, which require the user to specify anticipated values of the model's performance (in particular R2) and the outcome variance in the external validation dataset. A sensible starting point is to base values on those for the model development study, as obtained from the publication or study authors. The largest sample size required to meet all four criteria is the recommended minimum sample size needed in the external validation dataset. The calculations can also be applied to estimate expected precision when an existing dataset with a fixed sample size is available, to help gauge if it is adequate. We illustrate the proposed methods on a case‐study predicting fat‐free mass in children.
•After a clinical prediction model is developed, it is usually necessary to undertake an external validation study that examines the model's performance in new data from the same or different ...population. External validation studies should have an appropriate sample size, in order to estimate model performance measures precisely for calibration, discrimination and clinical utility.•Rules-of-thumb suggest at least 100 events and 100 nonevents. Such blanket guidance is imprecise, and not specific to the model or validation setting.•Our works shows that precision of performance estimates is affected by the model's linear predictor (LP) distribution, in addition to number of events and total sample size. Furthermore, sample sizes of 100 (or even 200) events and non-events can give imprecise estimates, especially for calibration.•Our new proposal uses a simulation-based sample size calculation, which accounts for the LP distribution and (mis)calibration in the validation sample, and calculates the sample size (and events) required conditional on these factors.•The approach requires the researcher to specify the desired precision for each performance measure of interest (calibration, discrimination, net benefit, etc), the model's anticipated LP distribution in the validation population, and whether or not the model is well calibrated. Guidance for how to specify these values is given, and R and Stata code is provided.
Sample size “rules-of-thumb” for external validation of clinical prediction models suggest at least 100 events and 100 non-events. Such blanket guidance is imprecise, and not specific to the model or validation setting. We investigate factors affecting precision of model performance estimates upon external validation, and propose a more tailored sample size approach.
Simulation of logistic regression prediction models to investigate factors associated with precision of performance estimates. Then, explanation and illustration of a simulation-based approach to calculate the minimum sample size required to precisely estimate a model's calibration, discrimination and clinical utility.
Precision is affected by the model's linear predictor (LP) distribution, in addition to number of events and total sample size. Sample sizes of 100 (or even 200) events and non-events can give imprecise estimates, especially for calibration. The simulation-based calculation accounts for the LP distribution and (mis)calibration in the validation sample. Application identifies 2430 required participants (531 events) for external validation of a deep vein thrombosis diagnostic model.
Where researchers can anticipate the distribution of the model's LP (eg, based on development sample, or a pilot study), a simulation-based approach for calculating sample size for external validation offers more flexibility and reliability than rules-of-thumb.
AbstractObjectiveTo examine the association between antihypertensive treatment and specific adverse events.DesignSystematic review and meta-analysis.Eligibility criteriaRandomised controlled trials ...of adults receiving antihypertensives compared with placebo or no treatment, more antihypertensive drugs compared with fewer antihypertensive drugs, or higher blood pressure targets compared with lower targets. To avoid small early phase trials, studies were required to have at least 650 patient years of follow-up.Information sourcesSearches were conducted in Embase, Medline, CENTRAL, and the Science Citation Index databases from inception until 14 April 2020.Main outcome measuresThe primary outcome was falls during trial follow-up. Secondary outcomes were acute kidney injury, fractures, gout, hyperkalaemia, hypokalaemia, hypotension, and syncope. Additional outcomes related to death and major cardiovascular events were extracted. Risk of bias was assessed using the Cochrane risk of bias tool, and random effects meta-analysis was used to pool rate ratios, odds ratios, and hazard ratios across studies, allowing for between study heterogeneity (τ2).ResultsOf 15 023 articles screened for inclusion, 58 randomised controlled trials were identified, including 280 638 participants followed up for a median of 3 (interquartile range 2-4) years. Most of the trials (n=40, 69%) had a low risk of bias. Among seven trials reporting data for falls, no evidence was found of an association with antihypertensive treatment (summary risk ratio 1.05, 95% confidence interval 0.89 to 1.24, τ2=0.009). Antihypertensives were associated with an increased risk of acute kidney injury (1.18, 95% confidence interval 1.01 to 1.39, τ2=0.037, n=15), hyperkalaemia (1.89, 1.56 to 2.30, τ2=0.122, n=26), hypotension (1.97, 1.67 to 2.32, τ2=0.132, n=35), and syncope (1.28, 1.03 to 1.59, τ2=0.050, n=16). The heterogeneity between studies assessing acute kidney injury and hyperkalaemia events was reduced when focusing on drugs that affect the renin angiotensin-aldosterone system. Results were robust to sensitivity analyses focusing on adverse events leading to withdrawal from each trial. Antihypertensive treatment was associated with a reduced risk of all cause mortality, cardiovascular death, and stroke, but not of myocardial infarction.ConclusionsThis meta-analysis found no evidence to suggest that antihypertensive treatment is associated with falls but found evidence of an association with mild (hyperkalaemia, hypotension) and severe adverse events (acute kidney injury, syncope). These data could be used to inform shared decision making between doctors and patients about initiation and continuation of antihypertensive treatment, especially in patients at high risk of harm because of previous adverse events or poor renal function.RegistrationPROSPERO CRD42018116860.
Previous articles in Statistics in Medicine describe how to calculate the sample size required for external validation of prediction models with continuous and binary outcomes. The minimum sample ...size criteria aim to ensure precise estimation of key measures of a model's predictive performance, including measures of calibration, discrimination, and net benefit. Here, we extend the sample size guidance to prediction models with a time‐to‐event (survival) outcome, to cover external validation in datasets containing censoring. A simulation‐based framework is proposed, which calculates the sample size required to target a particular confidence interval width for the calibration slope measuring the agreement between predicted risks (from the model) and observed risks (derived using pseudo‐observations to account for censoring) on the log cumulative hazard scale. Precise estimation of calibration curves, discrimination, and net‐benefit can also be checked in this framework. The process requires assumptions about the validation population in terms of the (i) distribution of the model's linear predictor and (ii) event and censoring distributions. Existing information can inform this; in particular, the linear predictor distribution can be approximated using the C‐index or Royston's D statistic from the model development article, together with the overall event risk. We demonstrate how the approach can be used to calculate the sample size required to validate a prediction model for recurrent venous thromboembolism. Ideally the sample size should ensure precise calibration across the entire range of predicted risks, but must at least ensure adequate precision in regions important for clinical decision‐making. Stata and R code are provided.
External validation studies are an important but often neglected part of prediction model research. In this article, the second in a series on model evaluation, Riley and colleagues explain what an ...external validation study entails and describe the key steps involved, from establishing a high quality dataset to evaluating a model’s predictive performance and clinical usefulness.