In this study, we systematically investigate the impact of class imbalance on classification performance of convolutional neural networks (CNNs) and compare frequently used methods to address the ...issue. Class imbalance is a common problem that has been comprehensively studied in classical machine learning, yet very limited systematic research is available in the context of deep learning. In our study, we use three benchmark datasets of increasing complexity, MNIST, CIFAR-10 and ImageNet, to investigate the effects of imbalance on classification and perform an extensive comparison of several methods to address the issue: oversampling, undersampling, two-phase training, and thresholding that compensates for prior class probabilities. Our main evaluation metric is area under the receiver operating characteristic curve (ROC AUC) adjusted to multi-class tasks since overall accuracy metric is associated with notable difficulties in the context of imbalanced data. Based on results from our experiments we conclude that (i) the effect of class imbalance on classification performance is detrimental; (ii) the method of addressing class imbalance that emerged as dominant in almost all analyzed scenarios was oversampling; (iii) oversampling should be applied to the level that completely eliminates the imbalance, whereas the optimal undersampling ratio depends on the extent of imbalance; (iv) as opposed to some classical machine learning models, oversampling does not cause overfitting of CNNs; (v) thresholding should be applied to compensate for prior class probabilities when overall number of properly classified cases is of interest.
Deep learning has shown tremendous potential in the task of object detection in images. However, a common challenge with this task is when only a limited number of images containing the object of ...interest are available. This is a particular issue in cancer screening, such as digital breast tomosynthesis (DBT), where less than 1% of cases contain cancer. In this study, we propose a method to train an inpainting generative adversarial network to be used for cancer detection using only images that do not contain cancer. During inference, we removed a part of the image and used the network to complete the removed part. A significant error in completing an image part was considered an indication that such location is unexpected and thus abnormal. A large dataset of DBT images used in this study was collected at Duke University. It consisted of 19,230 reconstructed volumes from 4348 patients. Cancerous masses and architectural distortions were marked with bounding boxes by radiologists. Our experiments showed that the locations containing cancer were associated with a notably higher completion error than the non-cancer locations (mean error ratio of 2.77). All data used in this study has been made publicly available by the authors.
Breast cancer screening is among the most common radiological tasks, with more than 39 million examinations performed each year. While it has been among the most studied medical imaging applications ...of artificial intelligence, the development and evaluation of algorithms are hindered by the lack of well-annotated, large-scale publicly available data sets.
To curate, annotate, and make publicly available a large-scale data set of digital breast tomosynthesis (DBT) images to facilitate the development and evaluation of artificial intelligence algorithms for breast cancer screening; to develop a baseline deep learning model for breast cancer detection; and to test this model using the data set to serve as a baseline for future research.
In this diagnostic study, 16 802 DBT examinations with at least 1 reconstruction view available, performed between August 26, 2014, and January 29, 2018, were obtained from Duke Health System and analyzed. From the initial cohort, examinations were divided into 4 groups and split into training and test sets for the development and evaluation of a deep learning model. Images with foreign objects or spot compression views were excluded. Data analysis was conducted from January 2018 to October 2020.
Screening DBT.
The detection algorithm was evaluated with breast-based free-response receiver operating characteristic curve and sensitivity at 2 false positives per volume.
The curated data set contained 22 032 reconstructed DBT volumes that belonged to 5610 studies from 5060 patients with a mean (SD) age of 55 (11) years and 5059 (100.0%) women. This included 4 groups of studies: (1) 5129 (91.4%) normal studies; (2) 280 (5.0%) actionable studies, for which where additional imaging was needed but no biopsy was performed; (3) 112 (2.0%) benign biopsied studies; and (4) 89 studies (1.6%) with cancer. Our data set included masses and architectural distortions that were annotated by 2 experienced radiologists. Our deep learning model reached breast-based sensitivity of 65% (39 of 60; 95% CI, 56%-74%) at 2 false positives per DBT volume on a test set of 460 examinations from 418 patients.
The large, diverse, and curated data set presented in this study could facilitate the development and evaluation of artificial intelligence algorithms for breast cancer screening by providing data for training as well as a common set of cases for model validation. The performance of the model developed in this study showed that the task remains challenging; its performance could serve as a baseline for future model development.
Recent analysis identified distinct genomic subtypes of lower-grade glioma tumors which are associated with shape features. In this study, we propose a fully automatic way to quantify tumor imaging ...characteristics using deep learning-based segmentation and test whether these characteristics are predictive of tumor genomic subtypes.
We used preoperative imaging and genomic data of 110 patients from 5 institutions with lower-grade gliomas from The Cancer Genome Atlas. Based on automatic deep learning segmentations, we extracted three features which quantify two-dimensional and three-dimensional characteristics of the tumors. Genomic data for the analyzed cohort of patients consisted of previously identified genomic clusters based on IDH mutation and 1p/19q co-deletion, DNA methylation, gene expression, DNA copy number, and microRNA expression. To analyze the relationship between the imaging features and genomic clusters, we conducted the Fisher exact test for 10 hypotheses for each pair of imaging feature and genomic subtype. To account for multiple hypothesis testing, we applied a Bonferroni correction. P-values lower than 0.005 were considered statistically significant.
We found the strongest association between RNASeq clusters and the bounding ellipsoid volume ratio (p < 0.0002) and between RNASeq clusters and margin fluctuation (p < 0.005). In addition, we identified associations between bounding ellipsoid volume ratio and all tested molecular subtypes (p < 0.02) as well as between angular standard deviation and RNASeq cluster (p < 0.02). In terms of automatic tumor segmentation that was used to generate the quantitative image characteristics, our deep learning algorithm achieved a mean Dice coefficient of 82% which is comparable to human performance.
Display omitted
•Quantification of tumor shape features using deep learning segmentation.•Genomic subtypes of lower-grade gliomas predicted from magnetic resonance images.•LGG RNASeq cluster R2 is associated with tumors of notably higher irregularity.
Deep learning is a branch of artificial intelligence where networks of simple interconnected units are used to extract patterns from data in order to solve complex problems. Deep‐learning algorithms ...have shown groundbreaking performance in a variety of sophisticated tasks, especially those related to images. They have often matched or exceeded human performance. Since the medical field of radiology mainly relies on extracting useful information from images, it is a very natural application area for deep learning, and research in this area has rapidly grown in recent years. In this article, we discuss the general context of radiology and opportunities for application of deep‐learning algorithms. We also introduce basic concepts of deep learning, including convolutional neural networks. Then, we present a survey of the research in deep learning applied to radiology. We organize the studies by the types of specific tasks that they attempt to solve and review a broad range of deep‐learning algorithms being utilized. Finally, we briefly discuss opportunities and challenges for incorporating deep learning in the radiology practice of the future.
Level of Evidence: 3
Technical Efficacy: Stage 1
J. Magn. Reson. Imaging 2019;49:939–954.
In this paper, we showed that deep learning is a highly promising tool that can be applied in the interpretation of thyroid ultrasound and decision making process for thyroid nodules.
The purpose is to apply a previously validated deep learning algorithm to a new thyroid nodule ultrasound image dataset and compare its performances with radiologists.
Prior study presented an ...algorithm which is able to detect thyroid nodules and then make malignancy classifications with two ultrasound images. A multi-task deep convolutional neural network was trained from 1278 nodules and originally tested with 99 separate nodules. The results were comparable with that of radiologists. The algorithm was further tested with 378 nodules imaged with ultrasound machines from different manufacturers and product types than the training cases. Four experienced radiologists were requested to evaluate the nodules for comparison with deep learning.
The Area Under Curve (AUC) of the deep learning algorithm and four radiologists were calculated with parametric, binormal estimation. For the deep learning algorithm, the AUC was 0.69 (95% CI: 0.64–0.75). The AUC of radiologists were 0.63 (95% CI: 0.59–0.67), 0.66 (95% CI:0.61–0.71), 0.65 (95% CI: 0.60–0.70), and 0.63 (95%CI: 0.58–0.67).
In the new testing dataset, the deep learning algorithm achieved similar performances with all four radiologists. The relative performance difference between the algorithm and the radiologists is not significantly affected by the difference of ultrasound scanner.
•In a new US dataset, our DL algorithm achieved similar performances with radiologists in thyroid noduleclassification.•New dataset consists of 378 nodules, over 95% are from manufacturers that accounts for less than 6% in the training dataset.•New dataset has a much higher percentage of malignancy (39%) than both the training dataset and the previous testing set.•Difference of some US image scanner types shows significant impact on the performance of both radiologists and DL algorithm.•Difference of US scanner types doesn’t show significant impact on the performance difference of radiologists and DL algorithm.