The use of artificial intelligence in medicine is currently an issue of great interest, especially with regard to the diagnostic or predictive analysis of medical images. Adoption of an artificial ...intelligence tool in clinical practice requires careful confirmation of its clinical utility. Herein, the authors explain key methodology points involved in a clinical evaluation of artificial intelligence technology for use in medicine, especially high-dimensional or overparameterized diagnostic or predictive models in which artificial deep neural networks are used, mainly from the standpoints of clinical epidemiology and biostatistics. First, statistical methods for assessing the discrimination and calibration performances of a diagnostic or predictive model are summarized. Next, the effects of disease manifestation spectrum and disease prevalence on the performance results are explained, followed by a discussion of the difference between evaluating the performance with use of internal and external datasets, the importance of using an adequate external dataset obtained from a well-defined clinical cohort to avoid overestimating the clinical performance as a result of overfitting in high-dimensional or overparameterized classification model and spectrum bias, and the essentials for achieving a more robust clinical evaluation. Finally, the authors review the role of clinical trials and observational outcome studies for ultimate clinical verification of diagnostic or predictive artificial intelligence tools through patient outcomes, beyond performance metrics, and how to design such studies.
RSNA, 2018.
Objectives
To assess the quality of current radiomics research on cardiac CT using radiomics quality score (RQS) and Transparent Reporting of a multivariable prediction model for Individual Prognosis ...Or Diagnosis (TRIPOD) systems.
Methods
Systematic searches of PubMed and EMBASE were performed to identify all potentially relevant original research articles about cardiac CT radiomics. Fifteen original research articles were selected. Two cardiac radiologists assessed the quality of the methodology adopted in those studies according to the RQS and TRIPOD guidelines. Basic adherence rates for the following six key domains were evaluated: image protocol and reproducibility, feature reduction and validation, biologic/clinical utility, performance index, high level of evidence, and open science.
Results
Among the 15 included articles, six (40%) were about coronary artery disease and six (40%) were about myocardial infarction. The mean RQS was 9.9 ± 7.3 (27.4% of the ideal score of 36), and the basic adherence rate was 44.6%. Fourteen (93.3%) and nine (60%) studies performed feature selection and validation, but only two (13.3%) of them performed external validation. Two studies (13.3%) were prospective, and only one study (6.7%) conducted calibration analysis and stated the potential clinical utility. None of the studies conducted phantom study and cost-effective analysis. The overall adherence rate for TRIPOD was 63%.
Conclusion
The quality of radiomics studies in cardiac CT is currently insufficient. A higher level of evidence is required, and analysis of clinical utility and calibration of model performance need to be improved.
Key Points
• The quality of science of radiomics studies in cardiac CT is currently insufficient.
• No study conducted a phantom study or cost-effective analysis, with further limitations being demonstrated in a high level of evidence for radiomics studies.
• Analysis of clinical utility and calibration of model performance need to be improved, and a higher level of evidence is required.
We evaluated the diagnostic performance and generalizability of traditional machine learning and deep learning models for distinguishing glioblastoma from single brain metastasis using radiomics. The ...training and external validation cohorts comprised 166 (109 glioblastomas and 57 metastases) and 82 (50 glioblastomas and 32 metastases) patients, respectively. Two-hundred-and-sixty-five radiomic features were extracted from semiautomatically segmented regions on contrast-enhancing and peritumoral T2 hyperintense masks and used as input data. For each of a deep neural network (DNN) and seven traditional machine learning classifiers combined with one of five feature selection methods, hyperparameters were optimized through tenfold cross-validation in the training cohort. The diagnostic performance of the optimized models and two neuroradiologists was tested in the validation cohort for distinguishing glioblastoma from metastasis. In the external validation, DNN showed the highest diagnostic performance, with an area under receiver operating characteristic curve (AUC), sensitivity, specificity, and accuracy of 0.956 (95% confidence interval CI, 0.918-0.990), 90.6% (95% CI, 80.5-100), 88.0% (95% CI, 79.0-97.0), and 89.0% (95% CI, 82.3-95.8), respectively, compared to the best-performing traditional machine learning model (adaptive boosting combined with tree-based feature selection; AUC, 0.890 (95% CI, 0.823-0.947)) and human readers (AUC, 0.774 95% CI, 0.685-0.852 and 0.904 95% CI, 0.852-0.951). The results demonstrated deep learning using radiomic features can be useful for distinguishing glioblastoma from metastasis with good generalizability.
The prevalence of abnormal cardiovascular magnetic resonance (CMR) findings in recovered coronavirus disease 2019 (COVID-19) patients is unclear. This study aimed to investigate the prevalence of ...abnormal CMR findings in recovered COVID-19 patients.
A systematic literature search was performed to identify studies that report the prevalence of abnormal CMR findings in recovered COVID-19 patients. The number of patients with abnormal CMR findings and diagnosis of myocarditis on CMR (based on the Lake Louise criteria) and each abnormal CMR parameter were extracted. Subgroup analyses were performed according to patient characteristics (athletes vs. non-athletes and normal vs. undetermined cardiac enzyme levels). The pooled prevalence and 95% confidence interval (CI) of each CMR finding were calculated. Study heterogeneity was assessed, and meta-regression analysis was performed to investigate factors associated with heterogeneity.
In total, 890 patients from 16 studies were included in the analysis. The pooled prevalence of one or more abnormal CMR findings in recovered COVID-19 patients was 46.4% (95% CI 43.2%-49.7%). The pooled prevalence of myocarditis and late gadolinium enhancement (LGE) was 14.0% (95% CI 11.6%-16.8%) and 20.5% (95% CI 17.7%-23.6%), respectively. Further, heterogeneity was observed (I
> 50%, p < 0.1). In the subgroup analysis, the pooled prevalence of abnormal CMR findings and myocarditis was higher in non-athletes than in athletes (62.5% vs. 17.1% and 23.9% vs. 2.5%, respectively). Similarly, the pooled prevalence of abnormal CMR findings and LGE was higher in the undetermined than in the normal cardiac enzyme level subgroup (59.4% vs. 35.9% and 45.5% vs. 8.3%, respectively). Being an athlete was a significant independent factor related to heterogeneity in multivariate meta-regression analysis (p < 0.05).
Nearly half of recovered COVID-19 patients exhibited one or more abnormal CMR findings. Athletes and patients with normal cardiac enzyme levels showed a lower prevalence of abnormal CMR findings than non-athletes and patients with undetermined cardiac enzyme levels. Trial registration The study protocol was registered in the PROSPERO database (registration number: CRD42020225234).
Mammography is the current standard for breast cancer screening. This study aimed to develop an artificial intelligence (AI) algorithm for diagnosis of breast cancer in mammography, and explore ...whether it could benefit radiologists by improving accuracy of diagnosis.
In this retrospective study, an AI algorithm was developed and validated with 170 230 mammography examinations collected from five institutions in South Korea, the USA, and the UK, including 36 468 cancer positive confirmed by biopsy, 59 544 benign confirmed by biopsy (8827 mammograms) or follow-up imaging (50 717 mammograms), and 74 218 normal. For the multicentre, observer-blinded, reader study, 320 mammograms (160 cancer positive, 64 benign, 96 normal) were independently obtained from two institutions. 14 radiologists participated as readers and assessed each mammogram in terms of likelihood of malignancy (LOM), location of malignancy, and necessity to recall the patient, first without and then with assistance of the AI algorithm. The performance of AI and radiologists was evaluated in terms of LOM-based area under the receiver operating characteristic curve (AUROC) and recall-based sensitivity and specificity.
The AI standalone performance was AUROC 0·959 (95% CI 0·952–0·966) overall, and 0·970 (0·963–0·978) in the South Korea dataset, 0·953 (0·938–0·968) in the USA dataset, and 0·938 (0·918–0·958) in the UK dataset. In the reader study, the performance level of AI was 0·940 (0·915–0·965), significantly higher than that of the radiologists without AI assistance (0·810, 95% CI 0·770–0·850; p<0·0001). With the assistance of AI, radiologists' performance was improved to 0·881 (0·850–0·911; p<0·0001). AI was more sensitive to detect cancers with mass (53 90% vs 46 78% of 59 cancers detected; p=0·044) or distortion or asymmetry (18 90% vs ten 50% of 20 cancers detected; p=0·023) than radiologists. AI was better in detection of T1 cancers (73 91% vs 59 74% of 80; p=0·0039) or node-negative cancers (104 87% vs 88 74% of 119; p=0·0025) than radiologists.
The AI algorithm developed with large-scale mammography data showed better diagnostic performance in breast cancer detection compared with radiologists. The significant improvement in radiologists' performance when aided by AI supports application of AI to mammograms as a diagnostic support tool.
Lunit.