Abstract
The rise of digital data and computing power have contributed to significant advancements in artificial intelligence (AI), leading to the use of classification and prediction models in ...health care to enhance clinical decision-making for diagnosis, treatment and prognosis. However, such advances are limited by the lack of reporting standards for the data used to develop those models, the model architecture, and the model evaluation and validation processes. Here, we present MINIMAR (MINimum Information for Medical AI Reporting), a proposal describing the minimum information necessary to understand intended predictions, target populations, and hidden biases, and the ability to generalize these emerging technologies. We call for a standard to accurately and responsibly report on AI in health care. This will facilitate the design and implementation of these models and promote the development and use of associated clinical decision support tools, as well as manage concerns regarding accuracy and bias.
Objective A key aspect of the precision medicine effort is the development of informatics tools that can analyze and interpret “big data” sets in an automated and adaptive fashion while providing ...accurate and actionable clinical information. The aims of this study were to develop machine learning algorithms for the identification of disease and the prognostication of mortality risk and to determine whether such models perform better than classical statistical analyses. Methods Focusing on peripheral artery disease (PAD), patient data were derived from a prospective, observational study of 1755 patients who presented for elective coronary angiography. We employed multiple supervised machine learning algorithms and used diverse clinical, demographic, imaging, and genomic information in a hypothesis-free manner to build models that could identify patients with PAD and predict future mortality. Comparison was made to standard stepwise linear regression models. Results Our machine-learned models outperformed stepwise logistic regression models both for the identification of patients with PAD (area under the curve, 0.87 vs 0.76, respectively; P = .03) and for the prediction of future mortality (area under the curve, 0.76 vs 0.65, respectively; P = .10). Both machine-learned models were markedly better calibrated than the stepwise logistic regression models, thus providing more accurate disease and mortality risk estimates. Conclusions Machine learning approaches can produce more accurate disease classification and prediction models. These tools may prove clinically useful for the automated identification of patients with highly morbid diseases for which aggressive risk factor management can improve outcomes.
Making Machine Learning Models Clinically Useful Shah, Nigam H; Milstein, Arnold; Bagley, PhD, Steven C
JAMA : the journal of the American Medical Association,
10/2019, Volume:
322, Issue:
14
Journal Article
Peer reviewed
This Viewpoint reviews conventional ways of assessing performance of machine learning models to diagnose or predict outcomes, but emphasizes that if machine learning is to improve patient care the ...models must be evaluated for their utility in improving clinical decisions taking into account the range of decisions clinicians can take, the cost and efficacy of those options, and the likelihood that patients will follow the recommended decisions.
In the electronic health record, using clinical notes to identify entities such as disorders and their temporality (e.g. the order of an event relative to a time index) can inform many important ...analyses. However, creating training data for clinical entity tasks is time consuming and sharing labeled data is challenging due to privacy concerns. The information needs of the COVID-19 pandemic highlight the need for agile methods of training machine learning models for clinical notes. We present Trove, a framework for weakly supervised entity classification using medical ontologies and expert-generated rules. Our approach, unlike hand-labeled notes, is easy to share and modify, while offering performance comparable to learning from manually labeled training data. In this work, we validate our framework on six benchmark tasks and demonstrate Trove's ability to analyze the records of patients visiting the emergency department at Stanford Health Care for COVID-19 presenting symptoms and risk factors.
Predictive analytics in health care has generated increasing enthusiasm recently, as reflected in a rapidly growing body of predictive models reported in literature and in real-time embedded models ...using electronic health record data. However, estimating the benefit of applying any single model to a specific clinical problem remains challenging today. Developing a shared framework for estimating model value is therefore critical to facilitate the effective, safe, and sustainable use of predictive tools into the future. We highlight key concepts within the prediction-action dyad that together are expected to impact model benefit. These include factors relevant to model prediction (including the number needed to screen) as well as those relevant to the subsequent action (number needed to treat). In the simplest terms, a number needed to benefit contextualizes the numbers needed to screen and treat, offering an opportunity to estimate the value of a clinical predictive model in action.
To test the association of androgen deprivation therapy (ADT) in the treatment of prostate cancer with subsequent Alzheimer's disease risk.
We used a previously validated and implemented ...text-processing pipeline to analyze electronic medical record data in a retrospective cohort of patients at Stanford University and Mt. Sinai hospitals. Specifically, we extracted International Classification of Diseases-9th revision diagnosis and Current Procedural Terminology codes, medication lists, and positive-present mentions of drug and disease concepts from all clinical notes. We then tested the effect of ADT on risk of Alzheimer's disease using 1:5 propensity score-matched and traditional multivariable-adjusted Cox proportional hazards models. The duration of ADT use was also tested for association with Alzheimer's disease risk.
There were 16,888 individuals with prostate cancer meeting all inclusion and exclusion criteria, with 2,397 (14.2%) receiving ADT during a median follow-up period of 2.7 years (interquartile range, 1.0-5.4 years). Propensity score-matched analysis (hazard ratio, 1.88; 95% CI, 1.10 to 3.20; P = .021) and traditional multivariable-adjusted Cox regression analysis (hazard ratio, 1.66; 95% CI, 1.05 to 2.64; P = .031) both supported a statistically significant association between ADT use and Alzheimer's disease risk. We also observed a statistically significant increased risk of Alzheimer's disease with increasing duration of ADT (P = .016).
Our results support an association between the use of ADT in the treatment of prostate cancer and an increased risk of Alzheimer's disease in a general population cohort. This study demonstrates the utility of novel methods to analyze electronic medical record data to generate practice-based evidence.
Temporal dataset shift associated with changes in healthcare over time is a barrier to deploying machine learning-based clinical decision support systems. Algorithms that learn robust models by ...estimating invariant properties across time periods for domain generalization (DG) and unsupervised domain adaptation (UDA) might be suitable to proactively mitigate dataset shift. The objective was to characterize the impact of temporal dataset shift on clinical prediction models and benchmark DG and UDA algorithms on improving model robustness. In this cohort study, intensive care unit patients from the MIMIC-IV database were categorized by year groups (2008-2010, 2011-2013, 2014-2016 and 2017-2019). Tasks were predicting mortality, long length of stay, sepsis and invasive ventilation. Feedforward neural networks were used as prediction models. The baseline experiment trained models using empirical risk minimization (ERM) on 2008-2010 (ERM08-10) and evaluated them on subsequent year groups. DG experiment trained models using algorithms that estimated invariant properties using 2008-2016 and evaluated them on 2017-2019. UDA experiment leveraged unlabelled samples from 2017 to 2019 for unsupervised distribution matching. DG and UDA models were compared to ERM08-16 models trained using 2008-2016. Main performance measures were area-under-the-receiver-operating-characteristic curve (AUROC), area-under-the-precision-recall curve and absolute calibration error. Threshold-based metrics including false-positives and false-negatives were used to assess the clinical impact of temporal dataset shift and its mitigation strategies. In the baseline experiments, dataset shift was most evident for sepsis prediction (maximum AUROC drop, 0.090; 95% confidence interval (CI), 0.080-0.101). Considering a scenario of 100 consecutively admitted patients showed that ERM08-10 applied to 2017-2019 was associated with one additional false-negative among 11 patients with sepsis, when compared to the model applied to 2008-2010. When compared with ERM08-16, DG and UDA experiments failed to produce more robust models (range of AUROC difference, - 0.003 to 0.050). In conclusion, DG and UDA failed to produce more robust models compared to ERM in the setting of temporal dataset shift. Alternate approaches are required to preserve model performance over time in clinical medicine.
Abstract
Using data for 20 912 patients from 2 large academic health systems, we analyzed the frequency of severe acute respiratory syndrome coronavirus 2 reverse-transcription polymerase chain ...reaction test discordance among individuals initially testing negative by nasopharyngeal swab who were retested on clinical grounds within 7 days. The frequency of subsequent positivity within this window was 3.5% and was similar across institutions.