To assess the diagnostic performance of a deep learning-based algorithm for automated detection of acute and chronic rib fractures on whole-body trauma CT.
We retrospectively identified all ...whole-body trauma CT scans referred from the emergency department of our hospital from January to December 2018 (n = 511). Scans were categorized as positive (n = 159) or negative (n = 352) for rib fractures according to the clinically approved written CT reports, which served as the index test. The bone kernel series (1.5-mm slice thickness) served as an input for a detection prototype algorithm trained to detect both acute and chronic rib fractures based on a deep convolutional neural network. It had previously been trained on an independent sample from eight other institutions (n = 11455).
All CTs except one were successfully processed (510/511). The algorithm achieved a sensitivity of 87.4% and specificity of 91.5% on a per-examination level per CT scan: rib fracture(s): yes/no. There were 0.16 false-positives per examination (= 81/510). On a per-finding level, there were 587 true-positive findings (sensitivity: 65.7%) and 307 false-negatives. Furthermore, 97 true rib fractures were detected that were not mentioned in the written CT reports. A major factor associated with correct detection was displacement.
We found good performance of a deep learning-based prototype algorithm detecting rib fractures on trauma CT on a per-examination level at a low rate of false-positives per case. A potential area for clinical application is its use as a screening tool to avoid false-negative radiology reports.
The use of artificial intelligence (AI) is a powerful tool for image analysis that is increasingly being evaluated by radiology professionals. However, due to the fact that these methods have been ...developed for the analysis of nonmedical image data and data structure in radiology departments is not "AI ready", implementing AI in radiology is not straightforward. The purpose of this review is to guide the reader through the pipeline of an AI project for automated image analysis in radiology and thereby encourage its implementation in radiology departments. At the same time, this review aims to enable readers to critically appraise articles on AI-based software in radiology.
Objectives
To evaluate the performance of a deep convolutional neural network (DCNN) in detecting and classifying distal radius fractures, metal, and cast on radiographs using labels based on ...radiology reports. The secondary aim was to evaluate the effect of the training set size on the algorithm’s performance.
Methods
A total of 15,775 frontal and lateral radiographs, corresponding radiology reports, and a ResNet18 DCNN were used. Fracture detection and classification models were developed per view and merged. Incrementally sized subsets served to evaluate effects of the training set size. Two musculoskeletal radiologists set the standard of reference on radiographs (test set A). A subset (B) was rated by three radiology residents. For a per-study-based comparison with the radiology residents, the results of the best models were merged. Statistics used were ROC and AUC, Youden’s J statistic (J), and Spearman’s correlation coefficient (ρ).
Results
The models’ AUC/J on (A) for metal and cast were 0.99/0.98 and 1.0/1.0. The models’ and residents’ AUC/J on (B) were similar on fracture (0.98/0.91; 0.98/0.92) and multiple fragments (0.85/0.58; 0.91/0.70). Training set size and AUC correlated on metal (ρ = 0.740), cast (ρ = 0.722), fracture (frontal ρ = 0.947, lateral ρ = 0.946), multiple fragments (frontal ρ = 0.856), and fragment displacement (frontal ρ = 0.595).
Conclusions
The models trained on a DCNN with report-based labels to detect distal radius fractures on radiographs are suitable to aid as a secondary reading tool; models for fracture classification are not ready for clinical use. Bigger training sets lead to better models in all categories except joint affection.
Key Points
• Detection of metal and cast on radiographs is excellent using AI and labels extracted from radiology reports.
• Automatic detection of distal radius fractures on radiographs is feasible and the performance approximates radiology residents.
• Automatic classification of the type of distal radius fracture varies in accuracy and is inferior for joint involvement and fragment displacement.
Objectives
To investigate the most common errors in residents’ preliminary reports, if structured reporting impacts error types and frequencies, and to identify possible implications for resident ...education and patient safety.
Material and methods
Changes in report content were tracked by a report comparison tool on a word level and extracted for 78,625 radiology reports dictated from September 2017 to December 2018 in our department. Following data aggregation according to word stems and stratification by subspecialty (e.g., neuroradiology) and imaging modality, frequencies of additions/deletions were analyzed for findings and impression report section separately and compared between subgroups.
Results
Overall modifications per report averaged 4.1 words, with demonstrably higher amounts of changes for cross-sectional imaging (CT: 6.4; MRI: 6.7) than non-cross-sectional imaging (radiographs: 0.2; ultrasound: 2.8). The four most frequently changed words (right, left, one, and none) remained almost similar among all subgroups (range: 0.072–0.117 per report; once every 9–14 reports). Albeit representing only 0.02% of analyzed words, they accounted for up to 9.7% of all observed changes. Subspecialties solely using structured reporting had substantially lower change ratios in the findings report section (mean: 0.2 per report) compared with prose-style reporting subspecialties (mean: 2.0). Relative frequencies of the most changed words remained unchanged.
Conclusion
Residents’ most common reporting errors in all subspecialties and modalities are laterality discriminator confusions (left/right) and unnoticed descriptor misregistration by speech recognition (one/none). Structured reporting reduces overall error rates, but does not affect occurrence of the most common errors. Increased error awareness and measures improving report correctness and ensuring patient safety are required.
Key Points
• The two most common reporting errors in residents’ preliminary reports are laterality discriminator confusions (left/right) and unnoticed descriptor misregistration by speech recognition (one/none).
• Structured reporting reduces the overall the error frequency in the findings report section by a factor of 10 (structured reporting: mean 0.2 per report; prose-style reporting: 2.0) but does not affect the occurrence of the two major errors.
• Staff radiologist review behavior noticeably differs between radiology subspecialties.
Purpose
To establish thresholds for contrast enhancement-based attenuation (CM) and iodine concentration (IOD) for the quantitative evaluation of enhancement in renal lesions on single-phase ...split-filter dual-energy CT (tbDECT) and combine measurements in a machine learning algorithm to potentially improve performance.
Material
126 patients with incidental renal cysts (both hypo- and hyperdense cysts) or high suspicion for renal cell carcinoma (312 total lesions) undergoing abdominal, portal venous phase tbDECT were initially included in this retrospective study. Gold standard was pathological confirmation or follow-up imaging (MRI or multiphasic CT). CM, IOD, and ROI size were recorded. Thresholds for CM and IOD were identified using Youden-Index of the empirical ROC curves. Decision tree (DTC) and random forest classifier (RFC) were trained. Sensitivities, specificities, and AUCs were compared using McNemar and DeLong test.
Results
The final study cohort comprised 40 enhancing and 113 non-enhancing renal lesions. Optimal thresholds for quantitative iodine measurements and contrast enhancement-based attenuation were 1.0 ± 0.0 mg/ml and 23.6 ± 0.3 HU, respectively. Single DECT parameters (IOD, CM) showed similar overall performance with an AUC of 0.894 and 0.858 (
p
= 0.541) (sensitivity 90 and 80%, specificity 88 and 92%, respectively). While overall performance for the DTC (AUC 0.944) was higher than RFC (AUC 0.886), this difference (
p
= 0.409) and comparison to CM (
p
= 0.243) and IOD (
p
= 0.353) was not statistically significant.
Conclusions
Enhancement in incidental renal lesions on single-phase tbDECT can be classified with up to 87.5% sensitivity and 94.6% specificity. Algorithms combining DECT parameters did not increase overall performance.
Artificial intelligence can assist in cardiac image interpretation. Here, we achieved a substantial reduction in time required to read a cardiovascular magnetic resonance (CMR) study to estimate left ...atrial volume without compromising accuracy or reliability. Rather than deploying a fully automatic black-box, we propose to incorporate the automated LA volumetry into a human-centric interactive image-analysis process.
Atri-U, an automated data analysis pipeline for long-axis cardiac cine images, computes the atrial volume by: (i) detecting the end-systolic frame, (ii) outlining the endocardial borders of the LA, (iii) localizing the mitral annular hinge points and constructing the longitudinal atrial diameters, equivalent to the usual workup done by clinicians. In every step human interaction is possible, such that the results provided by the algorithm can be accepted, corrected, or re-done from scratch. Atri-U was trained and evaluated retrospectively on a sample of 300 patients and then applied to a consecutive clinical sample of 150 patients with various heart conditions. The agreement of the indexed LA volume between Atri-U and two experts was similar to the inter-rater agreement between clinicians (average overestimation of 0.8 mL/m
with upper and lower limits of agreement of - 7.5 and 5.8 mL/m
, respectively). An expert cardiologist blinded to the origin of the annotations rated the outputs produced by Atri-U as acceptable in 97% of cases for step (i), 94% for step (ii) and 95% for step (iii), which was slightly lower than the acceptance rate of the outputs produced by a human expert radiologist in the same cases (92%, 100% and 100%, respectively). The assistance of Atri-U lead to an expected reduction in reading time of 66%-from 105 to 34 s, in our in-house clinical setting.
Our proposal enables automated calculation of the maximum LA volume approaching human accuracy and precision. The optional user interaction is possible at each processing step. As such, the assisted process sped up the routine CMR workflow by providing accurate, precise, and validated measurement results.
Objective: To assess the diagnostic performance of a deep learning-based algorithm for automated detection of acute and chronic rib fractures on whole-body trauma CT. Materials and Methods: We ...retrospectively identified all whole-body trauma CT scans referred from the emergency department of our hospital from January to December 2018 (n = 511). Scans were categorized as positive (n = 159) or negative (n = 352) for rib fractures according to the clinically approved written CT reports, which served as the index test. The bone kernel series (1.5-mm slice thickness) served as an input for a detection prototype algorithm trained to detect both acute and chronic rib fractures based on a deep convolutional neural network. It had previously been trained on an independent sample from eight other institutions (n = 11455). Results: All CTs except one were successfully processed (510/511). The algorithm achieved a sensitivity of 87.4% and specificity of 91.5% on a per-examination level per CT scan: rib fracture(s): yes/no. There were 0.16 false-positives per examination (= 81/510). On a per-finding level, there were 587 true-positive findings (sensitivity: 65.7%) and 307 false-negatives. Furthermore, 97 true rib fractures were detected that were not mentioned in the written CT reports. A major factor associated with correct detection was displacement. Conclusion: We found good performance of a deep learning-based prototype algorithm detecting rib fractures on trauma CT on a per-examination level at a low rate of false-positives per case. A potential area for clinical application is its use as a screening tool to avoid false-negative radiology reports.
Purpose
Thoracic aortic (TA) dilatation (TAD) is a risk factor for acute aortic syndrome and must therefore be reported in every CT report. However, the complex anatomy of the thoracic aorta impedes ...TAD detection. We investigated the performance of a deep learning (DL) prototype as a secondary reading tool built to measure TA diameters in a large-scale cohort.
Material and methods
Consecutive contrast-enhanced (CE) and non-CE chest CT exams with “normal” TA diameters according to their radiology reports were included. The DL-prototype (AIRad, Siemens Healthineers, Germany) measured the TA at nine locations according to AHA guidelines. Dilatation was defined as >45 mm at aortic sinus, sinotubular junction (STJ), ascending aorta (AA) and proximal arch and >40 mm from mid arch to abdominal aorta. A cardiovascular radiologist reviewed all cases with TAD according to AIRad. Multivariable logistic regression (MLR) was used to identify factors (demographics and scan parameters) associated with TAD classification by AIRad.
Results
18,243 CT scans (45.7% female) were successfully analyzed by AIRad. Mean age was 62.3 ± 15.9 years and 12,092 (66.3%) were CE scans. AIRad confirmed normal diameters in 17,239 exams (94.5%) and reported TAD in 1,004/18,243 exams (5.5%). Review confirmed TAD classification in 452/1,004 exams (45.0%, 2.5% total), 552 cases were false-positive but identification was easily possible using visual outputs by AIRad. MLR revealed that the following factors were significantly associated with correct TAD classification by AIRad: TAD reported at AA odds ratio (OR): 1.12,
p
< 0.001 and STJ (OR: 1.09,
p
= 0.002), TAD found at >1 location (OR: 1.42,
p
= 0.008), in CE exams (OR: 2.1–3.1,
p
< 0.05), men (OR: 2.4,
p
= 0.003) and patients presenting with higher BMI (OR: 1.05,
p
= 0.01). Overall, 17,691/18,243 (97.0%) exams were correctly classified.
Conclusions
AIRad correctly assessed the presence or absence of TAD in 17,691 exams (97%), including 452 cases with previously missed TAD independent from contrast protocol. These findings suggest its usefulness as a secondary reading tool by improving report quality and efficiency.