Abstract
In diagnostic medicine, the true disease status of a patient is often represented on an ordinal scale—for example, cancer stage (0, I, II, III, or IV) or coronary artery disease severity ...measured using the Coronary Artery Disease Reporting and Data System (CAD-RADS) scale (none, minimal, mild, moderate, severe, or occluded). With advances in quantitation of diagnostic images and in artificial intelligence (AI), both supervised and unsupervised algorithms are being developed to help physicians correctly grade disease. Most of the diagnostic accuracy literature deals with binary disease status (disease present or absent); however, tests diagnosing ordinal-scaled diseases should not be reduced to a binary status just to simplify diagnostic accuracy testing. In this paper, we propose different characterizations of ordinal-scale accuracy for different clinical use scenarios, along with methods for comparing tests. In the simplest scenario, just the proportion of correct grades is considered; other scenarios address the magnitude and direction of misgrading; and at the other extreme, a weighted accuracy measure with weights based on the relative costs of different types of misgrading is presented. The various scenarios are illustrated using a coronary artery disease example where the accuracy of AI algorithms in providing patients with the correct CAD-RADS grade is assessed.
The design and analysis of multireader multicase (MRMC) studies are quite challenging. These studies differ from most medical studies because they need a reference standard and sampling from two ...populations (ie, reader and patient populations). They are quite expensive to conduct, requiring a good deal of readers' time for image interpretation. One common problem is the use of imperfect reference standards, often correlated with the test or tests being evaluated. Another common issue is oversimplification of the multidimensional MRMC data. In this study, the fundamentals of MRMC study design and analysis are reviewed. The goal is to provide investigators with a guide to the fundamentals of MRMC design and analysis, with references to more detailed discussions. In addition, readers are updated on newer areas of research, including correction for studies with multiple diagnostic accuracy end points and adjustment for location bias.
Sensitivity and specificity are the basic measures of accuracy of a diagnostic test; however, they depend on the cut point used to define "positive" and "negative" test results. As the cut point ...shifts, sensitivity and specificity shift. The receiver operating characteristic (ROC) curve is a plot of the sensitivity of a test versus its false-positive rate for all possible cut points. The advantages of the ROC curve as a means of defining the accuracy of a test, construction of the ROC, and identification of the optimal cut point on the ROC curve are discussed. Several summary measures of the accuracy of a test, including the commonly used percentage of correct diagnoses and area under the ROC curve, are described and compared. Two examples of ROC curve application in radiologic research are presented.
Physiological properties of tumors can be measured both in vivo and noninvasively by diffusion‐weighted imaging and dynamic contrast‐enhanced magnetic resonance imaging. Although these techniques ...have been used for more than two decades to study tumor diffusion, perfusion, and/or permeability, the methods and studies on how to reduce measurement error and bias in the derived imaging metrics is still lacking in the literature. This is of paramount importance because the objective is to translate these quantitative imaging biomarkers (QIBs) into clinical trials, and ultimately in clinical practice. Standardization of the image acquisition using appropriate phantoms is the first step from a technical performance standpoint. The next step is to assess whether the imaging metrics have clinical value and meet the requirements for being a QIB as defined by the Radiological Society of North America's Quantitative Imaging Biomarkers Alliance (QIBA). The goal and mission of QIBA and the National Cancer Institute Quantitative Imaging Network (QIN) initiatives are to provide technical performance standards (QIBA profiles) and QIN tools for producing reliable QIBs for use in the clinical imaging community. Some of QIBA's development of quantitative diffusion‐weighted imaging and dynamic contrast‐enhanced QIB profiles has been hampered by the lack of literature for repeatability and reproducibility of the derived QIBs. The available research on this topic is scant and is not in sync with improvements or upgrades in MRI technology over the years. This review focuses on the need for QIBs in oncology applications and emphasizes the importance of the assessment of their reproducibility and repeatability.
Level of Evidence: 5
Technical Efficacy Stage: 1
J. Magn. Reson. Imaging 2019;49:e101–e121.
Although investigators in the imaging community have been active in developing and evaluating quantitative imaging biomarkers (QIBs), the development and implementation of QIBs have been hampered by ...the inconsistent or incorrect use of terminology or methods for technical performance and statistical concepts. Technical performance is an assessment of how a test performs in reference objects or subjects under controlled conditions. In this article, some of the relevant statistical concepts are reviewed, methods that can be used for evaluating and comparing QIBs are described, and some of the technical performance issues related to imaging biomarkers are discussed. More consistent and correct use of terminology and study design principles will improve clinical research, advance regulatory science, and foster better care for patients who undergo imaging studies.
Purpose To determine the linearity, bias, and precision of hepatic proton density fat fraction (PDFF) measurements by using magnetic resonance (MR) imaging across different field strengths, imager ...manufacturers, and reconstruction methods. Materials and Methods This meta-analysis was performed in accordance with Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. A systematic literature search identified studies that evaluated the linearity and/or bias of hepatic PDFF measurements by using MR imaging (hereafter, MR imaging-PDFF) against PDFF measurements by using colocalized MR spectroscopy (hereafter, MR spectroscopy-PDFF) or the precision of MR imaging-PDFF. The quality of each study was evaluated by using the Quality Assessment of Studies of Diagnostic Accuracy 2 tool. De-identified original data sets from the selected studies were pooled. Linearity was evaluated by using linear regression between MR imaging-PDFF and MR spectroscopy-PDFF measurements. Bias, defined as the mean difference between MR imaging-PDFF and MR spectroscopy-PDFF measurements, was evaluated by using Bland-Altman analysis. Precision, defined as the agreement between repeated MR imaging-PDFF measurements, was evaluated by using a linear mixed-effects model, with field strength, imager manufacturer, reconstruction method, and region of interest as random effects. Results Twenty-three studies (1679 participants) were selected for linearity and bias analyses and 11 studies (425 participants) were selected for precision analyses. MR imaging-PDFF was linear with MR spectroscopy-PDFF (R
= 0.96). Regression slope (0.97; P < .001) and mean Bland-Altman bias (-0.13%; 95% limits of agreement: -3.95%, 3.40%) indicated minimal underestimation by using MR imaging-PDFF. MR imaging-PDFF was precise at the region-of-interest level, with repeatability and reproducibility coefficients of 2.99% and 4.12%, respectively. Field strength, imager manufacturer, and reconstruction method each had minimal effects on reproducibility. Conclusion MR imaging-PDFF has excellent linearity, bias, and precision across different field strengths, imager manufacturers, and reconstruction methods.
RSNA, 2017 Online supplemental material is available for this article. An earlier incorrect version of this article appeared online. This article was corrected on October 2, 2017.
Quantitative imaging biomarkers (QIBs) are becoming increasingly adopted into clinical practice to monitor changes in patients' conditions. The repeatability coefficient (RC) is the clinical ...cut-point used to discern between changes in a biomarker's measurements due to measurement error and changes that exceed measurement error, thus indicating real change in the patient. Imaging biomarkers have characteristics that make them difficult for estimating the repeatability coefficient, including nonconstant error, non-Gaussian distributions, and measurement error that must be estimated from small studies.
We conducted a Monte Carlo simulation study to investigate how well three statistical methods for estimating the repeatability coefficient perform under five settings common for QIBs.
When the measurement error is constant and replicates are normally distributed, all of the statistical methods perform well. When the measurement error is proportional to the true value, approaches that use the log transformation or coefficient of variation perform similarly. For other common settings, none of the methods for estimating the repeatability coefficient perform adequately.
Many of the common approaches to estimating the repeatability coefficient perform well for only limited scenarios. The optimal approach depends strongly on the pattern of the within-subject variability; thus, a precision profile is critical in evaluating the technical performance of QIBs. Asymmetric bounds for detecting regression vs progression can be implemented and should be used when clinically appropriate.
Purpose To perform a meta-analysis to generate an estimate of the repeatability coefficient (RC) for magnetic resonance (MR) elastography of the liver. Materials and Methods A systematic search of ...databases was performed for publications on MR elastography during the 10-year period between 2006 and 2015. The identified studies were screened independently and were verified reciprocally by all authors. Two reviewers independently determined the percentage RC and effective sample size from each article. A forest plot was constructed of the percentage RC estimates from the 12 studies. Bootstrap 95% confidence intervals (CIs) were constructed for the summary percentage RCs. Results Twelve studies comprising 274 patients met the eligibility criteria and were included for analysis. A flow diagram of studies included according to Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines was prepared for the inclusion and exclusion criteria. All studies included in the meta-analysis fulfilled four or more of the seven categories of the Quality Assessment of Diagnostic Accuracy Studies (QUADAS)-2. The estimated summary RC was 22% (95% CI: 16.1%, 28.2%). The three main sources for this heterogeneity were the trained versus untrained operator drawing contours to choose regions of interest, the time between two replicate examinations, and, finally, the field strength of the MR imaging unit. The RC estimates tended to be higher for studies that did not use a well-trained operator, those with 1.5-T field strength imaging units, and those with longer time intervals between examinations. Conclusion The meta-analysis results provide the basis for the following draft longitudinal Quantitative Imaging Biomarkers Alliance MR elastography claim: A measured change in hepatic stiffness of 22% or greater, at the same site and with use of the same equipment and acquisition sequence, indicates that a true change in stiffness has occurred with 95% confidence.
RSNA, 2017.
The Quantitative Imaging Biomarkers Alliance (QIBA) Profile for fluorodeoxyglucose (FDG) PET/CT imaging was created by QIBA to both characterize and reduce the variability of standardized uptake ...values (SUVs). The Profile provides two complementary claims on the precision of SUV measurements. First, tumor glycolytic activity as reflected by the maximum SUV (SUV
) is measurable from FDG PET/CT with a within-subject coefficient of variation of 10%-12%. Second, a measured increase in SUV
of 39% or more, or a decrease of 28% or more, indicates that a true change has occurred with 95% confidence. Two applicable use cases are clinical trials and following individual patients in clinical practice. Other components of the Profile address the protocols and conformance standards considered necessary to achieve the performance claim. The Profile is intended for use by a broad audience; applications can range from discovery science through clinical trials to clinical practice. The goal of this report is to provide a rationale and overview of the FDG PET/CT Profile claims as well as its context, and to outline future needs and potential developments.