Psychology's Renaissance Nelson, Leif D; Simmons, Joseph; Simonsohn, Uri
Annual review of psychology,
01/2018, Volume:
69, Issue:
1
Journal Article
Peer reviewed
Open access
In 2010-2012, a few largely coincidental events led experimental psychologists to realize that their approach to collecting, analyzing, and reporting data made it too easy to publish false-positive ...findings. This sparked a period of methodological reflection that we review here and call Psychology's Renaissance
.
We begin by describing how psychologists' concerns with publication bias shifted from worrying about file-drawered studies to worrying about
p
-hacked analyses. We then review the methodological changes that psychologists have proposed and, in some cases, embraced. In describing how the renaissance has unfolded, we attempt to describe different points of view fairly but not neutrally, so as to identify the most promising paths forward. In so doing, we champion disclosure and preregistration, express skepticism about most statistical solutions to publication bias, take positions on the analysis and interpretation of replication failures, and contend that meta-analytical thinking
increases
the prevalence of false positives. Our general thesis is that the scientific practices of experimental psychologists have improved dramatically.
It is widely acknowledged that the biomedical literature suffers from a surfeit of false positive results. Part of the reason for this is the persistence of the myth that observation of p < 0.05 is ...sufficient justification to claim that you have made a discovery. It is hopeless to expect users to change their reliance on p-values unless they are offered an alternative way of judging the reliability of their conclusions. If the alternative method is to have a chance of being adopted widely, it will have to be easy to understand and to calculate. One such proposal is based on calculation of false positive risk(FPR). It is suggested that p-values and confidence intervals should continue to be given, but that they should be supplemented by a single additional number that conveys the strength of the evidence better than the p-value. This number could be the minimum FPR (that calculated on the assumption of a prior probability of 0.5, the largest value that can be assumed in the absence of hard prior data). Alternatively one could specify the prior probability that it would be necessary to believe in order to achieve an FPR of, say, 0.05.
Cluster failure Eklund, Anders; Nichols, Thomas E.; Knutsson, Hans
Proceedings of the National Academy of Sciences - PNAS,
07/2016, Volume:
113, Issue:
28
Journal Article
Peer reviewed
Open access
The most widely used task functional magnetic resonance imaging (fMRI) analyses use parametric statistical methods that depend on a variety of assumptions. In this work, we use real resting-state ...data and a total of 3 million random task group analyses to compute empirical familywise error rates for the fMRI software packages SPM, FSL, and AFNI, as well as a nonparametric permutation method. For a nominal familywise error rate of 5%, the parametric statistical methods are shown to be conservative for voxelwise inference and invalid for clusterwise inference. Our results suggest that the principal cause of the invalid cluster inferences is spatial autocorrelation functions that do not follow the assumed Gaussian shape. By comparison, the nonparametric permutation test is found to produce nominal results for voxelwise as well as clusterwise inference. These findings speak to the need of validating the statistical methods being used in the field of neuroimaging.
Recent reports of inflated false-positive rates (FPRs) in FMRI group analysis tools by Eklund and associates in 2016 have become a large topic within (and outside) neuroimaging. They concluded that ...existing parametric methods for determining statistically significant clusters had greatly inflated FPRs ("up to 70%," mainly due to the faulty assumption that the noise spatial autocorrelation function is Gaussian shaped and stationary), calling into question potentially "countless" previous results; in contrast, nonparametric methods, such as their approach, accurately reflected nominal 5% FPRs. They also stated that AFNI showed "particularly high" FPRs compared to other software, largely due to a bug in 3dClustSim. We comment on these points using their own results and figures and by repeating some of their simulations. Briefly, while parametric methods show some FPR inflation in those tests (and assumptions of Gaussian-shaped spatial smoothness also appear to be generally incorrect), their emphasis on reporting the single worst result from thousands of simulation cases greatly exaggerated the scale of the problem. Importantly, FPR statistics depends on "task" paradigm and voxelwise p value threshold; as such, we show how results of their study provide useful suggestions for FMRI study design and analysis, rather than simply a catastrophic downgrading of the field's earlier results. Regarding AFNI (which we maintain), 3dClustSim's bug effect was greatly overstated-their own results show that AFNI results were not "particularly" worse than others. We describe further updates in AFNI for characterizing spatial smoothness more appropriately (greatly reducing FPRs, although some remain >5%); in addition, we outline two newly implemented permutation/randomization-based approaches producing FPRs clustered much more tightly about 5% for voxelwise p ≤ 0.01.
Environmental DNA (eDNA) metabarcoding is increasingly used to study the present and past biodiversity. eDNA analyses often rely on amplification of very small quantities or degraded DNA. To avoid ...missing detection of taxa that are actually present (false negatives), multiple extractions and amplifications of the same samples are often performed. However, the level of replication needed for reliable estimates of the presence/absence patterns remains an unaddressed topic. Furthermore, degraded DNA and PCR/sequencing errors might produce false positives. We used simulations and empirical data to evaluate the level of replication required for accurate detection of targeted taxa in different contexts and to assess the performance of methods used to reduce the risk of false detections. Furthermore, we evaluated whether statistical approaches developed to estimate occupancy in the presence of observational errors can successfully estimate true prevalence, detection probability and false‐positive rates. Replications reduced the rate of false negatives; the optimal level of replication was strongly dependent on the detection probability of taxa. Occupancy models successfully estimated true prevalence, detection probability and false‐positive rates, but their performance increased with the number of replicates. At least eight PCR replicates should be performed if detection probability is not high, such as in ancient DNA studies. Multiple DNA extractions from the same sample yielded consistent results; in some cases, collecting multiple samples from the same locality allowed detecting more species. The optimal level of replication for accurate species detection strongly varies among studies and could be explicitly estimated to improve the reliability of results.
Objective: False positive reduction is one of the most crucial components in an automated pulmonary nodule detection system, which plays an important role in lung cancer diagnosis and early ...treatment. The objective of this paper is to effectively address the challenges in this task and therefore to accurately discriminate the true nodules from a large number of candidates. Methods: We propose a novel method employing three-dimensional (3-D) convolutional neural networks (CNNs) for false positive reduction in automated pulmonary nodule detection from volumetric computed tomography (CT) scans. Compared with its 2-D counterparts, the 3-D CNNs can encode richer spatial information and extract more representative features via their hierarchical architecture trained with 3-D samples. More importantly, we further propose a simple yet effective strategy to encode multilevel contextual information to meet the challenges coming with the large variations and hard mimics of pulmonary nodules. Results: The proposed framework has been extensively validated in the LUNA16 challenge held in conjunction with ISBI 2016, where we achieved the highest competition performance metric (CPM) score in the false positive reduction track. Conclusion: Experimental results demonstrated the importance and effectiveness of integrating multilevel contextual information into 3-D CNN framework for automated pulmonary nodule detection in volumetric CT data. Significance: While our method is tailored for pulmonary nodule detection, the proposed framework is general and can be easily extended to many other 3-D object detection tasks from volumetric medical images, where the targeting objects have large variations and are accompanied by a number of hard mimics.
We set out to investigate the interference factors that led to false-positive novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) IgM detection results using gold immunochromatography ...assay (GICA) and enzyme-linked immunosorbent assay (ELISA) and the corresponding solutions. GICA and ELISA were used to detect SARS-CoV-2 IgM in 86 serum samples, including 5 influenza A virus (Flu A) IgM-positive sera, 5 influenza B virus (Flu B) IgM-positive sera, 5
IgM-positive sera, 5
IgM-positive sera, 6 sera of HIV infection patients, 36 rheumatoid factor IgM (RF-IgM)-positive sera, 5 sera from hypertensive patients, 5 sera from diabetes mellitus patients, and 14 sera from novel coronavirus infection disease 19 (COVID-19) patients. The interference factors causing false-positive reactivity with the two methods were analyzed, and the urea dissociation test was employed to dissociate the SARS-CoV-2 IgM-positive serum using the best dissociation concentration. The two methods detected positive SARS-CoV-2 IgM in 22 mid-to-high-level-RF-IgM-positive sera and 14 sera from COVID-19 patients; the other 50 sera were negative. At a urea dissociation concentration of 6 mol/liter, SARS-CoV-2 IgM results were positive in 1 mid-to-high-level-RF-IgM-positive serum and in 14 COVID-19 patient sera detected using GICA. At a urea dissociation concentration of 4 mol/liter and with affinity index (AI) levels lower than 0.371 set to negative, SARS-CoV-2 IgM results were positive in 3 mid-to-high-level-RF-IgM-positive sera and in 14 COVID-19 patient sera detected using ELISA. The presence of RF-IgM at mid-to-high levels could lead to false-positive reactivity of SARS-CoV-2 IgM detected using GICA and ELISA, and urea dissociation tests would be helpful in reducing SARS-CoV-2 IgM false-positive results.
Background
False‐positive screening results are an inevitable and commonly recognized disadvantage of mammographic screening. This study estimated the cumulative probability of experiencing a first ...false‐positive screening result in women attending 10 biennial screening rounds in BreastScreen Norway, which targets women aged 50 to 69 years.
Methods
This retrospective cohort study analyzed screening outcomes from 421,545 women who underwent 1,894,523 screening examinations during 1995‐2019. Empirical data were used to calculate the cumulative risk of experiencing a first false‐positive screening result and a first false‐positive screening result that involved an invasive procedure over 10 screening rounds. Logistic regression was used to evaluate the effect of adjusting for irregular attendance, age at screening, and number of screens attended.
Results
The cumulative risk of experiencing a first false‐positive screening result was 18.04% (95% confidence interval CI, 18.00%‐18.07%). It was 5.01% (95% CI, 5.01%‐5.02%) for experiencing a false‐positive screening result that involved an invasive procedure. Adjusting for irregular attendance or age at screening did not appreciably affect these estimates. After adjustments for the number of screens attended, the cumulative risk of a first false‐positive screening result was 18.28% (95% CI, 18.24%‐18.32%), and the risk of a false‐positive screening result including an invasive procedure was 5.11% (95% CI, 5.11%‐5.22%). This suggested that there was minimal bias from dependent censoring.
Conclusions
Nearly 1 in 5 women will experience a false‐positive screening result if they attend 10 biennial screening rounds in BreastScreen Norway. One in 20 will experience a false‐positive screening result with an invasive procedure.
Lay Summary
A false‐positive screening result occurs when a woman attending mammographic screening is called back for further assessment because of suspicious findings, but the assessment does not detect breast cancer.
Further assessment includes additional imaging. Usually, it involves ultrasound, and sometimes, it involves a biopsy.
This study has evaluated the chance of experiencing a false‐positive screening result among women attending 10 screening examinations over 20 years in BreastScreen Norway.
Nearly 1 in 5 women will experience a false‐positive screening result over 10 screening rounds. One in 20 women will experience a false‐positive screening result involving a biopsy.
Using data from the population‐based breast screening program in Norway, this study finds that nearly 1 in 5 women attending 10 biennial screening rounds will experience a false‐positive screening result. One in 20 women will experience a false‐positive screening result that involves an invasive procedure.
Women tend to make a decision about participation in breast cancer screening and adhere to this for future invitations. Therefore, our study aimed to provide high‐quality information on cumulative ...risks of false‐positive (FP) recall and screen‐detected breast cancer over multiple screening examinations. Individual Dutch screening registry data (2005‐2018) were gathered on subsequent screening examinations of 92 902 women age 49 to 51 years in 2005. Survival analyses were used to calculate cumulative risks of a FP and a true‐positive (TP) result after seven examinations. Data from 66 472 women age 58 to 59 years were used to extrapolate to 11 examinations. Participation, detection and additional FP rates were calculated for women who previously received FP results compared to women with true negative (TN) results. After 7 examinations, the cumulative risk of a TP result was 3.7% and the cumulative risk of a FP result was 9.1%. After 11 examinations, this increased to 7.1% and 13.5%, respectively. Following a FP result, participation was lower (71%‐81%) than following a TN result (>90%). In women with a FP result, more TP results (factor 1.59 95% CI: 1.44‐1.72), more interval cancers (factor 1.66 95% CI: 1.41‐1.91) and more FP results (factor 1.96 95% CI: 1.87‐2.05) were found than in women with TN results. In conclusion, due to a low recall rate in the Netherlands, the cumulative risk of a FP recall is relatively low, while the cumulative risk of a TP result is comparable. Breast cancer diagnoses and FP results were more common in women with FP results than in women with TN results, while participation was lower.
What's new?
Population‐based breast cancer screening programmes reduce breast cancer mortality. However, presenting the potential risks over multiple screening examinations is crucial to enable women to make an informed choice about participation. In this breast cancer screening nationwide registry study using 13 years of follow‐up data from the Netherlands, the cumulative risk of a false‐positive recall was relatively low, while the cumulative risk of a true‐positive result was comparable to that in other European countries. The rates of screen‐detected and interval cancers and false‐positives were higher in women who had received false‐positive results than in women with true‐negative results, while their participation was lower.
Abstract Objectives Procedures for controlling the false positive rate when performing many hypothesis tests are commonplace in health and medical studies. Such procedures, most notably the ...Bonferroni adjustment, suffer from the problem that error rate control cannot be localized to individual tests, and that these procedures do not distinguish between exploratory and/or data-driven testing vs. hypothesis-driven testing. Instead, procedures derived from limiting false discovery rates may be a more appealing method to control error rates in multiple tests. Study Design and Setting Controlling the false positive rate can lead to philosophical inconsistencies that can negatively impact the practice of reporting statistically significant findings. We demonstrate that the false discovery rate approach can overcome these inconsistencies and illustrate its benefit through an application to two recent health studies. Results The false discovery rate approach is more powerful than methods like the Bonferroni procedure that control false positive rates. Controlling the false discovery rate in a study that arguably consisted of scientifically driven hypotheses found nearly as many significant results as without any adjustment, whereas the Bonferroni procedure found no significant results. Conclusion Although still unfamiliar to many health researchers, the use of false discovery rate control in the context of multiple testing can provide a solid basis for drawing conclusions about statistical significance.