The goals of this review paper on deep learning (DL) in medical imaging and radiation therapy are to (a) summarize what has been achieved to date; (b) identify common and unique challenges, and ...strategies that researchers have taken to address these challenges; and (c) identify some of the promising avenues for the future both in terms of applications as well as technical innovations. We introduce the general principles of DL and convolutional neural networks, survey five major areas of application of DL in medical imaging and radiation therapy, identify common themes, discuss methods for dataset expansion, and conclude by summarizing lessons learned, remaining challenges, and future directions.
The performance of a classifier is largely dependent on the size and representativeness of data used for its training. In circumstances where accumulation and/or labeling of training samples is ...difficult or expensive, such as medical applications, data augmentation can potentially be used to alleviate the limitations of small datasets. We have previously developed an image blending tool that allows users to modify or supplement an existing CT or mammography dataset by seamlessly inserting a lesion extracted from a source image into a target image. This tool also provides the option to apply various types of transformations to different properties of the lesion prior to its insertion into a new location. In this study, we used this tool to create synthetic samples that appear realistic in chest CT. We then augmented different size training sets with these artificial samples, and investigated the effect of the augmentation on training various classifiers for the detection of lung nodules. Our results indicate that the proposed lesion insertion method can improve classifier performance for small training datasets, and thereby help reduce the need to acquire and label actual patient data.
Scores produced by statistical classifiers in many clinical decision support systems and other medical diagnostic devices are generally on an arbitrary scale, so the clinical meaning of these scores ...is unclear. Calibration of classifier scores to a meaningful scale such as the probability of disease is potentially useful when such scores are used by a physician. In this work, we investigated three methods (parametric, semi-parametric, and non-parametric) for calibrating classifier scores to the probability of disease scale and developed uncertainty estimation techniques for these methods. We showed that classifier scores on arbitrary scales can be calibrated to the probability of disease scale without affecting their discrimination performance. With a finite dataset to train the calibration function, it is important to accompany the probability estimate with its confidence interval. Our simulations indicate that, when a dataset used for finding the transformation for calibration is also used for estimating the performance of calibration, the resubstitution bias exists for a performance metric involving the truth states in evaluating the calibration performance. However, the bias is small for the parametric and semi-parametric methods when the sample size is moderate to large (>100 per class).
Purpose
Multiview two‐dimensional (2D) convolutional neural networks (CNNs) and three‐dimensional (3D) CNNs have been successfully used for analyzing volumetric data in many state‐of‐the‐art medical ...imaging applications. We propose an alternative modular framework that analyzes volumetric data with an approach that is analogous to radiologists’ interpretation, and apply the framework to reduce false positives that are generated in computer‐aided detection (CADe) systems for pulmonary nodules in thoracic computed tomography (CT) scans.
Methods
In our approach, a deep network consisting of 2D CNNs first processes slices individually. The features extracted in this stage are then passed to a recurrent neural network (RNN), thereby modeling consecutive slices as a sequence of temporal data and capturing the contextual information across all three dimensions in the volume of interest. Outputs of the RNN layer are weighed before the final fully connected layer, enabling the network to scale the importance of different slices within a volume of interest in an end‐to‐end training framework.
Results
We validated the proposed architecture on the false positive reduction track of the lung nodule analysis (LUNA) challenge for pulmonary nodule detection in chest CT scans, and obtained competitive results compared to 3D CNNs. Our results show that the proposed approach can encode the 3D information in volumetric data effectively by achieving a sensitivity >0.8 with just 1/8 false positives per scan.
Conclusions
Our experimental results demonstrate the effectiveness of temporal analysis of volumetric images for the application of false positive reduction in chest CT scans and show that state‐of‐the‐art 2D architectures from the literature can be directly applied to analyzing volumetric medical data. As newer and better 2D architectures are being developed at a much faster rate compared to 3D architectures, our approach makes it easy to obtain state‐of‐the‐art performance on volumetric data using new 2D architectures.
In pathology, Immunohistochemical staining (IHC) of tissue sections is regularly used to diagnose and grade malignant tumors. Typically, IHC stain interpretation is rendered by a trained pathologist ...using a manual method, which consists of counting each positively- and negatively-stained cell under a microscope. The manual enumeration suffers from poor reproducibility even in the hands of expert pathologists. To facilitate this process, we propose a novel method to create artificial datasets with the known ground truth which allows us to analyze the recall, precision, accuracy, and intra- and inter-observer variability in a systematic manner, enabling us to compare different computer analysis approaches. Our method employs a conditional Generative Adversarial Network that uses a database of Ki67 stained tissues of breast cancer patients to generate synthetic digital slides. Our experiments show that synthetic images are indistinguishable from real images. Six readers (three pathologists and three image analysts) tried to differentiate 15 real from 15 synthetic images and the probability that the average reader would be able to correctly classify an image as synthetic or real more than 50% of the time was only 44.7%.
In a practical classifier design problem, the true population is generally unknown and the available sample is finite-sized. A common approach is to use a resampling technique to estimate the ...performance of the classifier that will be trained with the available sample. We conducted a Monte Carlo simulation study to compare the ability of the different resampling techniques in training the classifier and predicting its performance under the constraint of a finite-sized sample. The true population for the two classes was assumed to be multivariate normal distributions with known covariance matrices. Finite sets of sample vectors were drawn from the population. The true performance of the classifier is defined as the area under the receiver operating characteristic curve (AUC) when the classifier designed with the specific sample is applied to the true population. We investigated methods based on the Fukunaga–Hayes and the leave-one-out techniques, as well as three different types of bootstrap methods, namely, the ordinary, 0.632, and
0.632
+
bootstrap. The Fisher’s linear discriminant analysis was used as the classifier. The dimensionality of the feature space was varied from 3 to 15. The sample size
n
2
from the positive class was varied between 25 and 60, while the number of cases from the negative class was either equal to
n
2
or
3
n
2
. Each experiment was performed with an independent dataset randomly drawn from the true population. Using a total of 1000 experiments for each simulation condition, we compared the bias, the variance, and the root-mean-squared error (RMSE) of the AUC estimated using the different resampling techniques relative to the true AUC (obtained from training on a finite dataset and testing on the population). Our results indicated that, under the study conditions, there can be a large difference in the RMSE obtained using different resampling methods, especially when the feature space dimensionality is relatively large and the sample size is small. Under this type of conditions, the 0.632 and
0.632
+
bootstrap methods have the lowest RMSE, indicating that the difference between the estimated and the true performances obtained using the 0.632 and
0.632
+
bootstrap will be statistically smaller than those obtained using the other three resampling methods. Of the three bootstrap methods, the
0.632
+
bootstrap provides the lowest bias. Although this investigation is performed under some specific conditions, it reveals important trends for the problem of classifier performance prediction under the constraint of a limited dataset.
Digital tomosynthesis mammography (DTM) is a promising new modality for breast cancer detection. In DTM, projection-view images are acquired at a limited number of angles over a limited angular range ...and the imaged volume is reconstructed from the two-dimensional projections, thus providing three-dimensional structural information of the breast tissue. In this work, we investigated three representative reconstruction methods for this limited-angle cone-beam tomographic problem, including the backprojection (BP) method, the simultaneous algebraic reconstruction technique (SART) and the maximum likelihood method with the convex algorithm (ML-convex). The SART and ML-convex methods were both initialized with BP results to achieve efficient reconstruction. A second generation GE prototype tomosynthesis mammography system with a stationary digital detector was used for image acquisition. Projection-view images were acquired from 21 angles in
3
°
increments over a
±
30
°
angular range. We used an American College of Radiology phantom and designed three additional phantoms to evaluate the image quality and reconstruction artifacts. In addition to visual comparison of the reconstructed images of different phantom sets, we employed the contrast-to-noise ratio (CNR), a line profile of features, an artifact spread function (ASF), a relative noise power spectrum (NPS), and a line object spread function (LOSF) to quantitatively evaluate the reconstruction results. It was found that for the phantoms with homogeneous background, the BP method resulted in less noisy tomosynthesized images and higher CNR values for masses than the SART and ML-convex methods. However, the two iterative methods provided greater contrast enhancement for both masses and calcification, sharper LOSF, and reduced interplane blurring and artifacts with better ASF behaviors for masses. For a contrast-detail phantom with heterogeneous tissue-mimicking background, the BP method had strong blurring artifacts along the x-ray source motion direction that obscured the contrast-detail objects, while the other two methods can remove the superimposed breast structures and significantly improve object conspicuity. With a properly selected relaxation parameter, the SART method with one iteration can provide tomosynthesized images comparable to those obtained from the ML-convex method with seven iterations, when BP results were used as initialization for both methods.