Abstract
As machine learning research in the field of cardiovascular imaging continues to grow, obtaining reliable model performance estimates is critical to develop reliable baselines and compare ...different algorithms. While the machine learning community has generally accepted methods such as k-fold stratified cross-validation (CV) to be more rigorous than single split validation, the standard research practice in medical fields is the use of single split validation techniques. This is especially concerning given the relatively small sample sizes of datasets used for cardiovascular imaging. We aim to examine how train-test split variation impacts the stability of machine learning (ML) model performance estimates in several validation techniques on two real-world cardiovascular imaging datasets: stratified split-sample validation (70/30 and 50/50 train-test splits), tenfold stratified CV, 10 × repeated tenfold stratified CV, bootstrapping (500 × repeated), and leave one out (LOO) validation. We demonstrate that split validation methods lead to the highest range in AUC and statistically significant differences in ROC curves, unlike the other aforementioned approaches. When building predictive models on relatively small data sets as is often the case in medical imaging, split-sample validation techniques can produce instability in performance estimates with variations in range over 0.15 in the AUC values, and thus any of the alternate validation methods are recommended.
Combined analysis of SPECT myocardial perfusion imaging (MPI) performed with a solid-state camera on patients in 2 positions (semiupright, supine) is routinely used to mitigate attenuation artifacts. ...We evaluated the prediction of obstructive disease from combined analysis of semiupright and supine stress MPI by deep learning (DL) as compared with standard combined total perfusion deficit (TPD).
1,160 patients without known coronary artery disease (64% male) were studied. Patients underwent stress
Tc-sestamibi MPI with new-generation solid-state SPECT scanners in 4 different centers. All patients had on-site clinical reads and invasive coronary angiography correlations within 6 mo of MPI. Obstructive disease was defined as at least 70% narrowing of the 3 major coronary arteries and at least 50% for the left main coronary artery. Images were quantified at Cedars-Sinai. The left ventricular myocardium was segmented using standard clinical nuclear cardiology software. The contour placement was verified by an experienced technologist. Combined stress TPD was computed using sex- and camera-specific normal limits. DL was trained using polar distributions of normalized radiotracer counts, hypoperfusion defects, and hypoperfusion severities and was evaluated for prediction of obstructive disease in a novel leave-one-center-out cross-validation procedure equivalent to external validation. During the validation procedure, 4 DL models were trained using data from 3 centers and then evaluated on the 1 center left aside. Predictions for each center were merged to have an overall estimation of the multicenter performance.
718 (62%) patients and 1,272 of 3,480 (37%) arteries had obstructive disease. The area under the receiver operating characteristics curve for prediction of disease on a per-patient and per-vessel basis by DL was higher than for combined TPD (per-patient, 0.81 vs. 0.78; per-vessel, 0.77 vs. 0.73;
< 0.001). With the DL cutoff set to exhibit the same specificity as the standard cutoff for combined TPD, per-patient sensitivity improved from 61.8% (TPD) to 65.6% (DL) (
< 0.05), and per-vessel sensitivity improved from 54.6% (TPD) to 59.1% (DL) (
< 0.01). With the threshold matched to the specificity of a normal clinical read (56.3%), DL had a sensitivity of 84.8%, versus 82.6% for an on-site clinical read (
= 0.3).
DL improves automatic interpretation of MPI as compared with current quantitative methods.
Optimal risk stratification with machine learning (ML) from myocardial perfusion imaging (MPI) includes both clinical and imaging data. While most imaging variables can be derived automatically, ...clinical variables require manual collection, which is time-consuming and prone to error. We determined the fewest manually input and imaging variables required to maintain the prognostic accuracy for major adverse cardiac events (MACE) in patients undergoing a single-photon emission computed tomography (SPECT) MPI.
This study included 20 414 patients from the multicentre REFINE SPECT registry and 2984 from the University of Calgary for training and external testing of the ML models, respectively. ML models were trained using all variables (ML-All) and all image-derived variables (including age and sex, ML-Image). Next, ML models were sequentially trained by incrementally adding manually input and imaging variables to baseline ML models based on their importance ranking. The fewest variables were determined as the ML models (ML-Reduced, ML-Minimum, and ML-Image-Reduced) that achieved comparable prognostic performance to ML-All and ML-Image. Prognostic accuracy of the ML models was compared with visual diagnosis, stress total perfusion deficit (TPD), and traditional multivariable models using area under the receiver-operating characteristic curve (AUC). ML-Minimum (AUC 0.798) obtained comparable prognostic accuracy to ML-All (AUC 0.799, P = 0.19) by including 12 of 40 manually input variables and 11 of 58 imaging variables. ML-Reduced achieved comparable accuracy (AUC 0.796) with a reduced set of manually input variables and all imaging variables. In external validation, the ML models also obtained comparable or higher prognostic accuracy than traditional multivariable models.
Reduced ML models, including a minimum set of manually collected or imaging variables, achieved slightly lower accuracy compared to a full ML model but outperformed standard interpretation methods and risk models. ML models with fewer collected variables may be more practical for clinical implementation.
Abstract
Aims
To optimize per-vessel prediction of early coronary revascularization (ECR) within 90 days after fast single-photon emission computed tomography (SPECT) myocardial perfusion imaging ...(MPI) using machine learning (ML) and introduce a method for a patient-specific explanation of ML results in a clinical setting.
Methods and results
A total of 1980 patients with suspected coronary artery disease (CAD) underwent stress/rest 99mTc-sestamibi/tetrofosmin MPI with new-generation SPECT scanners were included. All patients had invasive coronary angiography within 6 months after SPECT MPI. ML utilized 18 clinical, 9 stress test, and 28 imaging variables to predict per-vessel and per-patient ECR with 10-fold cross-validation. Area under the receiver operator characteristics curve (AUC) of ML was compared with standard quantitative analysis total perfusion deficit (TPD) and expert interpretation. ECR was performed in 958 patients (48%). Per-vessel, the AUC of ECR prediction by ML (AUC 0.79, 95% confidence interval (CI) 0.77, 0.80) was higher than by regional stress TPD (0.71, 0.70, 0.73), combined-view stress TPD (AUC 0.71, 95% CI 0.69, 0.72), or ischaemic TPD (AUC 0.72, 95% CI 0.71, 0.74), all P < 0.001. Per-patient, the AUC of ECR prediction by ML (AUC 0.81, 95% CI 0.79, 0.83) was higher than that of stress TPD, combined-view TPD, and ischaemic TPD, all P < 0.001. ML also outperformed nuclear cardiologists’ expert interpretation of MPI for the prediction of early revascularization performance. A method to explain ML prediction for an individual patient was also developed.
Conclusion
In patients with suspected CAD, the prediction of ECR by ML outperformed automatic MPI quantitation by TPDs (per-vessel and per-patient) or nuclear cardiologists’ expert interpretation (per-patient).
Explainable artificial intelligence (AI) can be integrated within standard clinical software to facilitate the acceptance of the diagnostic findings during clinical interpretation.
This study sought ...to develop and evaluate a novel, general purpose, explainable deep learning model (coronary artery disease–deep learning CAD-DL) for the detection of obstructive CAD following single-photon emission computed tomography (SPECT) myocardial perfusion imaging (MPI).
A total of 3,578 patients with suspected CAD undergoing SPECT MPI and invasive coronary angiography within a 6-month interval from 9 centers were studied. CAD-DL computes the probability of obstructive CAD from stress myocardial perfusion, wall motion, and wall thickening maps, as well as left ventricular volumes, age, and sex. Myocardial regions contributing to the CAD-DL prediction are highlighted to explain the findings to the physician. A clinical prototype was integrated using a standard clinical workstation. Diagnostic performance by CAD-DL was compared to automated quantitative total perfusion deficit (TPD) and reader diagnosis.
In total, 2,247 patients (63%) had obstructive CAD. In 10-fold repeated testing, the area under the receiver-operating characteristic curve (AUC) (95% CI) was higher according to CAD-DL (AUC: 0.83 95% CI: 0.82-0.85) than stress TPD (AUC: 0.78 95% CI: 0.77-0.80) or reader diagnosis (AUC: 0.71 95% CI: 0.69-0.72; P < 0.0001 for both). In external testing, the AUC in 555 patients was higher according to CAD-DL (AUC: 0.80 95% CI: 0.76-0.84) than stress TPD (AUC: 0.73 95% CI: 0.69-0.77) or reader diagnosis (AUC: 0.65 95% CI: 0.61-0.69; P < 0.001 for all). The present model can be integrated within standard clinical software and generates results rapidly (<12 seconds on a standard clinical workstation) and therefore could readily be incorporated into a typical clinical workflow.
The deep-learning model significantly surpasses the diagnostic accuracy of standard quantitative analysis and clinical visual reading for MPI. Explainable artificial intelligence can be integrated within standard clinical software to facilitate acceptance of artificial intelligence diagnosis of CAD following MPI.
Display omitted
Chest computed tomography is one of the most common diagnostic tests, with 15 million scans performed annually in the United States. Coronary calcium can be visualized on these scans, but other ...measures of cardiac risk such as atrial and ventricular volumes have classically required administration of contrast. Here we show that a fully automated pipeline, incorporating two artificial intelligence models, automatically quantifies coronary calcium, left atrial volume, left ventricular mass, and other cardiac chamber volumes in 29,687 patients from three cohorts. The model processes chamber volumes and coronary artery calcium with an end-to-end time of ~18 s, while failing to segment only 0.1% of cases. Coronary calcium, left atrial volume, and left ventricular mass index are independently associated with all-cause and cardiovascular mortality and significantly improve risk classification compared to identification of abnormalities by a radiologist. This automated approach can be integrated into clinical workflows to improve identification of abnormalities and risk stratification, allowing physicians to improve clinical decision-making.
Purpose
We sought to evaluate inter-scan and inter-reader agreement of coronary calcium (CAC) scores obtained from dedicated, ECG-gated CAC scans (standard CAC scan) and ultra-low-dose, ungated ...computed tomography attenuation correction (CTAC) scans obtained routinely during cardiac PET/CT imaging.
Methods
From 2928 consecutive patients who underwent same-day
82
Rb cardiac PET/CT and gated CAC scan in the same hybrid PET/CT scanning session, we have randomly selected 200 cases with no history of revascularization. Standard CAC scans and ungated CTAC scans were scored by two readers using quantitative clinical software. We assessed the agreement between readers and between two scan protocols in 5 CAC categories (0, 1–10, 11–100, 101–400, and > 400) using Cohen’s Kappa and concordance.
Results
Median age of patients was 70 (inter-quartile range: 63–77), and 46% were male. The inter-scan concordance index and Cohen’s Kappa for readers 1 and 2 were 0.69; 0.75 (0.69, 0.81) and 0.72; 0.8 (0.75, 0.85) respectively. The inter-reader concordance index and Cohen’s Kappa (95% confidence interval CI) was higher for standard CAC scans: 0.9 and 0.92 (0.89, 0.96), respectively, vs. for CTAC scans: 0.83 and 0.85 (0.79, 0.9) for CTAC scans (
p
= 0.02 for difference in Kappa). Most discordant readings between two protocols occurred for scans with low extent of calcification (CAC score < 100).
Conclusion
CAC can be quantitatively assessed on PET CTAC maps with good agreement with standard scans, however with limited sensitivity for small lesions. CAC scoring of CTAC can be performed routinely without modification of PET protocol and added radiation dose.
Low-dose ungated CT attenuation correction (CTAC) scans are commonly obtained with SPECT/CT myocardial perfusion imaging. Despite the characteristically low image quality of CTAC, deep learning (DL) ...can potentially quantify coronary artery calcium (CAC) from these scans in an automatic manner. We evaluated CAC quantification derived with a DL model, including correlation with expert annotations and associations with major adverse cardiovascular events (MACE).
We trained a convolutional long short-term memory DL model to automatically quantify CAC on CTAC scans using 6,608 studies (2 centers) and evaluated the model in an external cohort of patients without known coronary artery disease (
= 2,271) obtained in a separate center. We assessed agreement between DL and expert annotated CAC scores. We also assessed associations between MACE (death, revascularization, myocardial infarction, or unstable angina) and CAC categories (0, 1-100, 101-400, or >400) for scores manually derived by experienced readers and scores obtained fully automatically by DL using multivariable Cox models (adjusted for age, sex, past medical history, perfusion, and ejection fraction) and net reclassification index.
In the external testing population, DL CAC was 0 in 908 patients (40.0%), 1-100 in 596 (26.2%), 100-400 in 354 (15.6%), and >400 in 413 (18.2%). Agreement in CAC category by DL CAC and expert annotation was excellent (linear weighted κ, 0.80), but DL CAC was obtained automatically in less than 2 s compared with about 2.5 min for expert CAC. DL CAC category was an independent risk factor for MACE with hazard ratios in comparison to a CAC of zero: CAC of 1-100 (2.20; 95% CI, 1.54-3.14;
< 0.001), CAC of 101-400 (4.58; 95% CI, 3.23-6.48;
< 0.001), and CAC of more than 400 (5.92; 95% CI, 4.27-8.22;
< 0.001). Overall, the net reclassification index was 0.494 for DL CAC, which was similar to expert annotated CAC (0.503).
DL CAC from SPECT/CT attenuation maps agrees well with expert CAC annotations and provides a similar risk stratification but can be obtained automatically. DL CAC scores improved classification of a significant proportion of patients as compared with SPECT myocardial perfusion alone.
To improve diagnostic accuracy, myocardial perfusion imaging (MPI) SPECT studies can use CT-based attenuation correction (AC). However, CT-based AC is not available for most SPECT systems in clinical ...use, increases radiation exposure, and is impacted by misregistration. We developed and externally validated a deep-learning model to generate simulated AC images directly from non-AC (NC) SPECT, without the need for CT.
SPECT myocardial perfusion imaging was performed using
Tc-sestamibi or
Tc-tetrofosmin on contemporary scanners with solid-state detectors. We developed a conditional generative adversarial neural network that applies a deep learning model (DeepAC) to generate simulated AC SPECT images. The model was trained with short-axis NC and AC images performed at 1 site (
= 4,886) and was tested on patients from 2 separate external sites (
= 604). We assessed the diagnostic accuracy of the stress total perfusion deficit (TPD) obtained from NC, AC, and DeepAC images for obstructive coronary artery disease (CAD) with area under the receiver-operating-characteristic curve. We also quantified the direct count change among AC, NC, and DeepAC images on a per-voxel basis.
DeepAC could be obtained in less than 1 s from NC images; area under the receiver-operating-characteristic curve for obstructive CAD was higher for DeepAC TPD (0.79; 95% CI, 0.72-0.85) than for NC TPD (0.70; 95% CI, 0.63-0.78;
< 0.001) and similar to AC TPD (0.81; 95% CI, 0.75-0.87;
= 0.196). The normalcy rate in the low-likelihood-of-coronary-disease population was higher for DeepAC TPD (70.4%) and AC TPD (75.0%) than for NC TPD (54.6%,
< 0.001 for both). The positive count change (increase in counts) was significantly higher for AC versus NC (median, 9.4; interquartile range, 6.0-14.2;
< 0.001) than for AC versus DeepAC (median, 2.4; interquartile range, 1.3-4.2).
In an independent external dataset, DeepAC provided improved diagnostic accuracy for obstructive CAD, as compared with NC images, and this accuracy was similar to that of actual AC. DeepAC simplifies the task of artifact identification for physicians, avoids misregistration artifacts, and can be performed rapidly without the need for CT hardware and additional acquisitions.
This study compared the ability of automated myocardial perfusion imaging analysis to predict major adverse cardiac events (MACE) to that of visual analysis.
Quantitative analysis has not been ...compared with clinical visual analysis in prognostic studies.
A total of 19,495 patients from the multicenter REFINE SPECT (REgistry of Fast Myocardial Perfusion Imaging with NExt generation SPECT) study (64 ± 12 years of age, 56% males) undergoing stress Tc-99m-labeled single-photon emission computed tomography (SPECT) myocardial perfusion imaging were followed for 4.5 ± 1.7 years for MACE. Perfusion abnormalities were assessed visually and categorized as normal, probably normal, equivocal, or abnormal. Stress total perfusion deficit (TPD), quantified automatically, was categorized as TPD = 0%, TPD >0% to <1%, ≤1% to <3%, ≤3% to <5%, ≤5% to ≤10%, or TPD >10%. MACE consisted of death, nonfatal myocardial infarction, unstable angina, or late revascularization (>90 days). Kaplan-Meier and Cox proportional hazards analyses were performed to test the performance of visual and quantitative assessments in predicting MACE.
During follow-up examinations, 2,760 (14.2%) MACE occurred. MACE rates increased with worsening of visual assessments, that is, the rate for normal MACE was 2.0%, 3.2% for probably normal, 4.2% for equivocal, and 7.4% for abnormal (all p < 0.001). MACE rates increased with increasing stress TPD from 1.3% for the TPD category of 0% to 7.8% for the TPD category of >10% (p < 0.0001). The adjusted hazard ratio (HR) for MACE increased even in equivocal assessment (HR: 1.56; 95% confidence interval CI: 1.37 to 1.78) and in the TPD category of ≤3% to <5% (HR: 1.74; 95% CI: 1.41 to 2.14; all p < 0.001). The rate of MACE in patients visually assessed as normal still increased from 1.3% (TPD = 0%) to 3.4% (TPD ≥5%) (p < 0.0001).
Quantitative analysis allows precise granular risk stratification in comparison to visual reading, even for cases with normal clinical reading.