Abstract Objectives To collect reasons for selecting the methods for meta-analysis of diagnostic accuracy from authors of systematic reviews and improve guidance on recommended methods. Study Design ...and Setting Online survey in authors of recently published meta-analyses of diagnostic accuracy. Results We identified 100 eligible reviews, of which 40 had used more advanced methods of meta-analysis (hierarchical random-effects approach), 52 more traditional methods (summary receiver operating characteristic curve based on linear regression or a univariate approach), and 8 combined both. Fifty-nine authors responded to the survey; 29 (49%) authors had used advanced methods, 25 (42%) authors traditional methods, and 5 (9%) authors combined traditional and advanced methods. Most authors who had used advanced methods reported to do so because they believed that these methods are currently recommended ( n = 27; 93%). Most authors who had used traditional methods also reported to do so because they believed that these methods are currently recommended ( n = 18; 75%) or easy to understand ( n = 18; 75%). Conclusion Although more advanced methods for meta-analysis are recommended by The Cochrane Collaboration, both authors using these methods and those using more traditional methods responded that the methods they used were currently recommended. Clearer and more widespread dissemination of guidelines on recommended methods for meta-analysis of test accuracy data is needed.
The clinical implications of SARS-CoV-2 infection are highly variable. Some people with SARS-CoV-2 infection remain asymptomatic, whilst the infection can cause mild to moderate COVID-19 and COVID-19 ...pneumonia in others. This can lead to some people requiring intensive care support and, in some cases, to death, especially in older adults. Symptoms such as fever, cough, or loss of smell or taste, and signs such as oxygen saturation are the first and most readily available diagnostic information. Such information could be used to either rule out COVID-19, or select patients for further testing. This is an update of this review, the first version of which published in July 2020.
To assess the diagnostic accuracy of signs and symptoms to determine if a person presenting in primary care or to hospital outpatient settings, such as the emergency department or dedicated COVID-19 clinics, has COVID-19.
For this review iteration we undertook electronic searches up to 15 July 2020 in the Cochrane COVID-19 Study Register and the University of Bern living search database. In addition, we checked repositories of COVID-19 publications. We did not apply any language restrictions.
Studies were eligible if they included patients with clinically suspected COVID-19, or if they recruited known cases with COVID-19 and controls without COVID-19. Studies were eligible when they recruited patients presenting to primary care or hospital outpatient settings. Studies in hospitalised patients were only included if symptoms and signs were recorded on admission or at presentation. Studies including patients who contracted SARS-CoV-2 infection while admitted to hospital were not eligible. The minimum eligible sample size of studies was 10 participants. All signs and symptoms were eligible for this review, including individual signs and symptoms or combinations. We accepted a range of reference standards.
Pairs of review authors independently selected all studies, at both title and abstract stage and full-text stage. They resolved any disagreements by discussion with a third review author. Two review authors independently extracted data and resolved disagreements by discussion with a third review author. Two review authors independently assessed risk of bias using the Quality Assessment tool for Diagnostic Accuracy Studies (QUADAS-2) checklist. We presented sensitivity and specificity in paired forest plots, in receiver operating characteristic space and in dumbbell plots. We estimated summary parameters using a bivariate random-effects meta-analysis whenever five or more primary studies were available, and whenever heterogeneity across studies was deemed acceptable.
We identified 44 studies including 26,884 participants in total. Prevalence of COVID-19 varied from 3% to 71% with a median of 21%. There were three studies from primary care settings (1824 participants), nine studies from outpatient testing centres (10,717 participants), 12 studies performed in hospital outpatient wards (5061 participants), seven studies in hospitalised patients (1048 participants), 10 studies in the emergency department (3173 participants), and three studies in which the setting was not specified (5061 participants). The studies did not clearly distinguish mild from severe COVID-19, so we present the results for all disease severities together. Fifteen studies had a high risk of bias for selection of participants because inclusion in the studies depended on the applicable testing and referral protocols, which included many of the signs and symptoms under study in this review. This may have especially influenced the sensitivity of those features used in referral protocols, such as fever and cough. Five studies only included participants with pneumonia on imaging, suggesting that this is a highly selected population. In an additional 12 studies, we were unable to assess the risk for selection bias. This makes it very difficult to judge the validity of the diagnostic accuracy of the signs and symptoms from these included studies. The applicability of the results of this review update improved in comparison with the original review. A greater proportion of studies included participants who presented to outpatient settings, which is where the majority of clinical assessments for COVID-19 take place. However, still none of the studies presented any data on children separately, and only one focused specifically on older adults. We found data on 84 signs and symptoms. Results were highly variable across studies. Most had very low sensitivity and high specificity. Only cough (25 studies) and fever (7 studies) had a pooled sensitivity of at least 50% but specificities were moderate to low. Cough had a sensitivity of 67.4% (95% confidence interval (CI) 59.8% to 74.1%) and specificity of 35.0% (95% CI 28.7% to 41.9%). Fever had a sensitivity of 53.8% (95% CI 35.0% to 71.7%) and a specificity of 67.4% (95% CI 53.3% to 78.9%). The pooled positive likelihood ratio of cough was only 1.04 (95% CI 0.97 to 1.11) and that of fever 1.65 (95% CI 1.41 to 1.93). Anosmia alone (11 studies), ageusia alone (6 studies), and anosmia or ageusia (6 studies) had sensitivities below 50% but specificities over 90%. Anosmia had a pooled sensitivity of 28.0% (95% CI 17.7% to 41.3%) and a specificity of 93.4% (95% CI 88.3% to 96.4%). Ageusia had a pooled sensitivity of 24.8% (95% CI 12.4% to 43.5%) and a specificity of 91.4% (95% CI 81.3% to 96.3%). Anosmia or ageusia had a pooled sensitivity of 41.0% (95% CI 27.0% to 56.6%) and a specificity of 90.5% (95% CI 81.2% to 95.4%). The pooled positive likelihood ratios of anosmia alone and anosmia or ageusia were 4.25 (95% CI 3.17 to 5.71) and 4.31 (95% CI 3.00 to 6.18) respectively, which is just below our arbitrary definition of a 'red flag', that is, a positive likelihood ratio of at least 5. The pooled positive likelihood ratio of ageusia alone was only 2.88 (95% CI 2.02 to 4.09). Only two studies assessed combinations of different signs and symptoms, mostly combining fever and cough with other symptoms. These combinations had a specificity above 80%, but at the cost of very low sensitivity (< 30%).
The majority of individual signs and symptoms included in this review appear to have very poor diagnostic accuracy, although this should be interpreted in the context of selection bias and heterogeneity between studies. Based on currently available data, neither absence nor presence of signs or symptoms are accurate enough to rule in or rule out COVID-19. The presence of anosmia or ageusia may be useful as a red flag for COVID-19. The presence of fever or cough, given their high sensitivities, may also be useful to identify people for further testing. Prospective studies in an unselected population presenting to primary care or hospital outpatient settings, examining combinations of signs and symptoms to evaluate the syndromic presentation of COVID-19, are still urgently needed. Results from such studies could inform subsequent management decisions.
The large and increasing number of new studies published each year is making literature identification in systematic reviews ever more time-consuming and costly. Technological assistance has been ...suggested as an alternative to the conventional, manual study identification to mitigate the cost, but previous literature has mainly evaluated methods in terms of recall (search sensitivity) and workload reduction. There is a need to also evaluate whether screening prioritization methods leads to the same results and conclusions as exhaustive manual screening. In this study, we examined the impact of one screening prioritization method based on active learning on sensitivity and specificity estimates in systematic reviews of diagnostic test accuracy.
We simulated the screening process in 48 Cochrane reviews of diagnostic test accuracy and re-run 400 meta-analyses based on a least 3 studies. We compared screening prioritization (with technological assistance) and screening in randomized order (standard practice without technology assistance). We examined if the screening could have been stopped before identifying all relevant studies while still producing reliable summary estimates. For all meta-analyses, we also examined the relationship between the number of relevant studies and the reliability of the final estimates.
The main meta-analysis in each systematic review could have been performed after screening an average of 30% of the candidate articles (range 0.07 to 100%). No systematic review would have required screening more than 2308 studies, whereas manual screening would have required screening up to 43,363 studies. Despite an average 70% recall, the estimation error would have been 1.3% on average, compared to an average 2% estimation error expected when replicating summary estimate calculations.
Screening prioritization coupled with stopping criteria in diagnostic test accuracy reviews can reliably detect when the screening process has identified a sufficient number of studies to perform the main meta-analysis with an accuracy within pre-specified tolerance limits. However, many of the systematic reviews did not identify a sufficient number of studies that the meta-analyses were accurate within a 2% limit even with exhaustive manual screening, i.e., using current practice.
A systematic and extensive search for as many eligible studies as possible is essential in any systematic review. When searching for diagnostic test accuracy (DTA) studies in bibliographic databases, ...it is recommended that terms for disease (target condition) are combined with terms for the diagnostic test (index test). Researchers have developed methodological filters to try to increase the precision of these searches. These consist of text words and database indexing terms and would be added to the target condition and index test searches.Efficiently identifying reports of DTA studies presents challenges because the methods are often not well reported in their titles and abstracts, suitable indexing terms may not be available and relevant indexing terms do not seem to be consistently assigned. A consequence of using search filters to identify records for diagnostic reviews is that relevant studies might be missed, while the number of irrelevant studies that need to be assessed may not be reduced. The current guidance for Cochrane DTA reviews recommends against the addition of a methodological search filter to target condition and index test search, as the only search approach.
To systematically review empirical studies that report the development or evaluation, or both, of methodological search filters designed to retrieve DTA studies in MEDLINE and EMBASE.
We searched MEDLINE (1950 to week 1 November 2012); EMBASE (1980 to 2012 Week 48); the Cochrane Methodology Register (Issue 3, 2012); ISI Web of Science (11 January 2013); PsycINFO (13 March 2013); Library and Information Science Abstracts (LISA) (31 May 2010); and Library, Information Science & Technology Abstracts (LISTA) (13 March 2013). We undertook citation searches on Web of Science, checked the reference lists of relevant studies, and searched the Search Filters Resource website of the InterTASC Information Specialists' Sub-Group (ISSG).
Studies reporting the development or evaluation, or both, of a MEDLINE or EMBASE search filter aimed at retrieving DTA studies, which reported a measure of the filter's performance were eligible.
The main outcome was a measure of filter performance, such as sensitivity or precision. We extracted data on the identification of the reference set (including the gold standard and, if used, the non-gold standard records), how the reference set was used and any limitations, the identification and combination of the search terms in the filters, internal and external validity testing, the number of filters evaluated, the date the study was conducted, the date the searches were completed, and the databases and search interfaces used. Where 2 x 2 data were available on filter performance, we used these to calculate sensitivity, specificity, precision and Number Needed to Read (NNR), and 95% confidence intervals (CIs). We compared the performance of a filter as reported by the original development study and any subsequent studies that evaluated the same filter.
Ninteen studies were included, reporting on 57 MEDLINE filters and 13 EMBASE filters. Thirty MEDLINE and four EMBASE filters were tested in an evaluation study where the performance of one or more filters was tested against one or more gold standards. The reported outcome measures varied. Some studies reported specificity as well as sensitivity if a reference set containing non-gold standard records in addition to gold standard records was used. In some cases, the original development study did not report any performance data on the filters. Original performance from the development study was not available for 17 filters that were subsequently tested in evaluation studies. All 19 studies reported the sensitivity of the filters that they developed or evaluated, nine studies reported the specificities and 14 studies reported the precision.No filter which had original performance data from its development study, and was subsequently tested in an evaluation study, had what we defined a priori as acceptable sensitivity (> 90%) and precision (> 10%). In studies that developed MEDLINE filters that were evaluated in another study (n = 13), the sensitivity ranged from 55% to 100% (median 86%) and specificity from 73% to 98% (median 95%). Estimates of performance were lower in eight studies that evaluated the same 13 MEDLINE filters, with sensitivities ranging from 14% to 100% (median 73%) and specificities ranging from 15% to 96% (median 81%). Precision ranged from 1.1% to 40% (median 9.5%) in studies that developed MEDLINE filters and from 0.2% to 16.7% (median 4%) in studies that evaluated these filters. A similar range of specificities and precision were reported amongst the evaluation studies for MEDLINE filters without an original performance measure. Sensitivities ranged from 31% to 100% (median 71%), specificity ranged from 13% to 90% (median 55.5%) and precision from 1.0% to 11.0% (median 3.35%).For the EMBASE filters, the original sensitivities reported in two development studies ranged from 74% to 100% (median 90%) for three filters, and precision ranged from 1.2% to 17.6% (median 3.7%). Evaluation studies of these filters had sensitivities from 72% to 97% (median 86%) and precision from 1.2% to 9% (median 3.7%). The performance of EMBASE search filters in development and evaluation studies were more alike than the performance of MEDLINE filters in development and evaluation studies. None of the EMBASE filters in either type of study had a sensitivity above 90% and precision above 10%.
None of the current methodological filters designed to identify reports of primary DTA studies in MEDLINE or EMBASE combine sufficiently high sensitivity, required for systematic reviews, with a reasonable degree of precision. This finding supports the current recommendation in the Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy that the combination of methodological filter search terms with terms for the index test and target condition should not be used as the only approach when conducting formal searches to inform systematic reviews of DTA.
Most prognostic models for primary sclerosing cholangitis (PSC) are based on patients referred to tertiary care and may not be applicable for the majority of patients with PSC. The aim of this study ...was to construct and externally validate a novel, broadly applicable prognostic model for transplant-free survival in PSC, based on a large, predominantly population-based cohort using readily available variables.
The derivation cohort consisted of 692 patients with PSC from the Netherlands, the validation cohort of 264 patients with PSC from the UK. Retrospectively, clinical and biochemical variables were collected. We derived the prognostic index from a multivariable Cox regression model in which predictors were selected and parameters were estimated using the least absolute shrinkage and selection operator. The composite end point of PSC-related death and liver transplantation was used. To quantify the models' predictive value, we calculated the C-statistic as discrimination index and established its calibration accuracy by comparing predicted curves with Kaplan-Meier estimates.
The final model included the variables: PSC subtype, age at PSC diagnosis, albumin, platelets, aspartate aminotransferase, alkaline phosphatase and bilirubin. The C-statistic was 0.68 (95% CI 0.51 to 0.85). Calibration was satisfactory. The model was robust in the sense that the C-statistic did not change when prediction was based on biochemical variables collected at follow-up.
The Amsterdam-Oxford model for PSC showed adequate performance in estimating PSC-related death and/or liver transplant in a predominantly population-based setting. The transplant-free survival probability can be recalculated when updated biochemical values are available.
(1) To identify and classify comparative diagnostic test accuracy (DTA) study designs; (2) to describe study design labels used by authors of comparative DTA studies.
We performed a methodological ...review of 100 comparative DTA studies published between 2015 and 2017, randomly sampled from studies included in 238 comparative DTA systematic reviews indexed in MEDLINE in 2017. From each study report, we extracted six design elements characterizing participant flow and the labels used by authors.
We identified a total of 46 unique combinations of study design features in our sample, based on six design elements characterizing participant flow. We classified the studies into five study design categories based on how participants were allocated to receive each index test: ‘fully paired’ (n=79), ‘partially paired, random subset’ (n=0), ‘partially paired, nonrandom subset’ (n=2), ‘unpaired randomized’ (n=1) and ‘unpaired nonrandomized’ (n=3). The allocation method used in 15 studies was unclear. Sixty-one studies reported, in total, 29 unique study design labels but only four labels referred to specific design features of comparative studies.
Our classification scheme can help systematic review authors define study eligibility criteria, assess risk of bias, and communicate the strength of the evidence. A standardized labelling scheme could be developed to facilitate communication of specific design features.
An assessment of the validity of individual diagnostic accuracy studies in systematic reviews is necessary to guide the analysis and the interpretation of results. Such an assessment is performed for ...each included study and typically reported at the study level. As studies may differ in sample size and disease prevalence, with larger studies contributing more to the meta-analysis, such a study-level report does not always reflect the risk of bias in the total body of evidence. We aimed to develop improved methods of presenting the risk of bias in the available evidence on diagnostic accuracy of medical tests in systematic reviews, reflecting the relative contribution of the study to the body of evidence in the review.
We applied alternative methods to represent evaluations with the Quality Assessment of Diagnostic Accuracy Studies tool (QUADAS-2), weighting studies according to their relative contribution to the total sample size or their relative effective sample size. We used these methods in four existing systematic reviews of diagnostic accuracy studies, including 9, 13, 22, and 32 studies, respectively.
The risk-of-bias summaries for each domain of the QUADAS-2 checklist changed in all four sets of studies after replacing unit weights for the studies with relative sample sizes or with the relative effective sample size. As an example, the risk of bias was high in the patient selection domain in 31% of the studies in one review, unclear in 23% and low in 46% of studies. Weighting studies according to the relative sample size changed the corresponding proportions to 4%, 4%, and 92%, respectively. The difference between the two weighting methods was small and more noticeable when the reviews included a smaller number of studies with wider range of sample size.
We present an alternative way of presenting the results of risk-of-bias assessments in systematic reviews of diagnostic accuracy studies. Weighting studies according to their relative sample size or their relative effective sample size can be used as more informative summaries of the risk of bias in the total body of available evidence.
Not applicable.
The value of a medical test depends on the context in which it might be used. Ideally, questions, results and conclusions of a diagnostic test accuracy (DTA) systematic review should be presented in ...light of this context. There is increasing acceptance of the value for knowing the impact a test can have on downstream consequences such as costs, implications for further testing and treatment options however there is currently no explicit guidance on how to address this. Authors of a Cochrane diagnostic review have recently been asked to include the clinical pathway in which a test maybe used. We aimed to evaluate how authors were developing their clinical pathways in the light of this.
We searched the Cochrane Database of Systematic Reviews for all published DTA reviews. We included only those reviews that included a clinical pathway. We developed a checklist, based on the guidance in the Cochrane Handbook for DTA review authors. To this, we added a number of additional descriptors. We checked if the included pathways fulfilled these descriptors as defined by our checklist.
We found 47 reviews, of which 33 (73 %) contained aspects pertaining to a clinical pathway. The 33 reviews addressed the clinical pathway differently, both in content and format. Of these, 21 provided a textual description and 12 include visual and textual descriptions. There was considerable variation in how comprehensively review authors adhered to our checklist. Eighteen reviews (51 %) linked the index test results to downstream clinical management actions and patient consequences, but only eight went on to differentially report on the consequences for false negative results and nine on the consequences for false positive results.
There is substantial variation in the clinical pathway descriptions in Cochrane systematic reviews of test accuracy. Most reviews do not link misclassifications (i.e. false negatives and false positive) to downstream patient consequences. Review authors could benefit from more explicit guidance on how to create such pathways, which in turn can help guide them in their evidence selection and appraisal of the evidence in the context of downstream consequences of testing.
Background
Mucosal Leishmaniasis (ML), a neglected tropical disease caused by
Leishmania
parasites, impairs the quality of life of under-resourced populations in South America. If not treated ...promptly, this disease progresses to facial deformities and death. The low sensitivity of microscopy results and the unavailability of other accurate tests hamper the diagnosis. As clinical criteria are readily available in any setting, these may be combined in a syndromic algorithm, which in turn can be used as a diagnostic tool. We explore potential clinical criteria for a syndromic diagnostic algorithm for ML in rural healthcare settings in South America.
Methodology/Principal findings
The protocol for this systematic review was pre-registered in PROSPERO with the number: CRD42017074148. In patients with ML, described in case series identified through a systematic retrieval process, we explored the cumulative ML detection rates of clinical criteria. Participants: all patients with active mucosal disease from an endemic area in South America. Any original, non-treatment study was eligible, and case reports were excluded. PUBMED, EMBASE, Web of Science, SCIELO, and LILACS databases were searched without restrictions. The risk of bias was assessed with the JBI checklist for case series. We included 10 full texts describing 192 ML patients. Male gender had the highest detection rate (88%), followed by ulcer of the nasal mucosa (77%), age >15 (69%), and symptom duration >4 months (63%).
Significance
Within this selection of patients, we found that the male gender, ulcer of the nasal mucosa, age >15, and symptom duration >4 months lead to the highest detection rates. However, higher detection comes -naturally- with a higher rate of false positives as well. As we only included ML patients, this could not be verified. Therefore, the criteria that we found to be most promising should be validated in a well-designed prospective study.