It has long been argued that we need to consider much more than an observed point estimate and a p-value to understand statistical results. One of the most persistent misconceptions about p-values is ...that they are necessarily calculated assuming a null hypothesis of no effect is true. Instead, p-values can and should be calculated for multiple hypothesized values for the effect size. For example, a p-value function allows us to visualize results continuously by examining how the p-value varies as we move across possible effect sizes. For more focused discussions, a 95% confidence interval shows the subset of possible effect sizes that have p-values larger than 0.05 as calculated from the same data and the same background statistical assumptions. In this sense a confidence interval can be taken as showing the effect sizes that are most compatible with the data, given the assumptions, and thus may be better termed a compatibility interval. The question that should then be asked is whether any or all of the effect sizes within the interval are substantial enough to be of practical importance.
The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American ...Statistical Association). We review why degrading
-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small
-values at face value, but mistrust results with larger
-values. In either case,
-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (
≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging,
-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher,
-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger
-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger
-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that
-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.
As a consequence of climate warming, species usually shift their distribution towards higher latitudes or altitudes. Yet, it is unclear how different taxonomic groups may respond to climate warming ...over larger altitudinal ranges. Here, we used data from the national biodiversity monitoring program of Switzerland, collected over an altitudinal range of 2500 m. Within the short period of eight years (2003-2010), we found significant shifts in communities of vascular plants, butterflies and birds. At low altitudes, communities of all species groups changed towards warm-dwelling species, corresponding to an average uphill shift of 8 m, 38 m and 42 m in plant, butterfly and bird communities, respectively. However, rates of community changes decreased with altitude in plants and butterflies, while bird communities changed towards warm-dwelling species at all altitudes. We found no decrease in community variation with respect to temperature niches of species, suggesting that climate warming has not led to more homogenous communities. The different community changes depending on altitude could not be explained by different changes of air temperatures, since during the 16 years between 1995 and 2010, summer temperatures in Switzerland rose by about 0.07°C per year at all altitudes. We discuss that land-use changes or increased disturbances may have prevented alpine plant and butterfly communities from changing towards warm-dwelling species. However, the findings are also consistent with the hypothesis that unlike birds, many alpine plant species in a warming climate could find suitable habitats within just a few metres, due to the highly varied surface of alpine landscapes. Our results may thus support the idea that for plants and butterflies and on a short temporal scale, alpine landscapes are safer places than lowlands in a warming world.
Abstract
The marsh frog (Pelophylax ridibundus s.l.) is the number one amphibian invader in Western Europe. In Switzerland, marsh frogs were introduced in the 1950–1960s and progressively colonized ...most of the northern parts of the country. We investigated this invasion using molecular tools. We mapped the cryptic presence of three monophyletic mitochondrial lineages (P. ridibundus, Pelophylax kurtmuelleri, and Pelophylax cf. bedriagae from southeastern Europe) consistent with registered importations by a local frog-leg industry. High nuclear diversity supports that invasive frogs probably originated from genetically rich import batches, and patterns of population differentiation confirm that multiple independent introduction sites were involved. Moreover, several lines of evidence suggest occasional hybridization with local hybridogenetic water frogs. This invasion emphasizes the issues of frequent amphibian releases and translocations at the international and regional scale for commercial and recreational purposes, and stresses the need for more adequate legislation, control, and information for the general public. Given the parallel invasion by exotic pool frogs (i.e. the Italian Pelophylax bergeri has replaced the local Pelophylax lessonae), the situation of water frogs in Switzerland is critical. The water frog complex provides an alarming symbol of the anthropogenic mark left on wildlife diversity and distributions.
Horses are gaining importance in European nature conservation management, for which usually so-called primitive breeds are favored due to their claimed robustness. An increasingly popular breed, the ...Konik horse, is often said to be the direct descendent of the alleged European wild horse, the Tarpan. However, both the direct descent of the Konik from European wild horses and the existence of the Tarpan as a wild species are highly debated. In this review, we scrutinized both contemporary research and historical sources and suggest that the Tarpan and the Konik as its direct descendent are manmade myths that hinder effective conservation management. We did not find evidence that the Tarpan was a wild horse rather than a feral horse. We did not find any evidence either for a closer connection between the Konik and any extinct wild horse than between other domestic breeds and wild horses. We discuss three perspectives on why the myth has become widely accepted and survived to this day: a historical-political, a biological-ecological, and an emotional perspective. It seems that the origin story of the Konik and its connection to the Tarpan was shaped by personal and political interests, including nationalistic ideas. These as well as general human emotions towards horses have influenced researchers and laypeople to keep the myth alive, which has been possibly negatively impacting contemporary nature conservation. Indeed, today’s Koniks originated from a small founder population of only six male lines that were selected according to their phenotypic traits, with the aim to rebreed the ‘wild Tarpan’. Strict breeding practices have led to high inbreeding levels in recent Konik populations, which may undermine nature conservation purposes. Therefore, we suggest that mythologized origin stories should not be an argument for selecting breeds of grazers for nature conservation.
Birds often have a peak of singing activity at dawn, and the timing of dawn song is species-specific. However, the start of singing at dawn may also depend on environmental factors. We investigated ...the effects of different environmental variables on the start of dawn singing in six common songbird species in the woodlands of the Swiss National Park. Moon phase, aspect, temperature and road noise had the most consistent effects across species: dawn singing started earlier after brighter and warmer nights, on more east-exposed slopes, and in areas with more road noise. On average, birds started to sing 2.8 min earlier in areas with high road noise level compared to areas without road noise, and 4.7 min earlier in east-exposed slopes compared to west-exposed slopes. Further, birds started to sing on average 5.0 min earlier after full moon compared to new moon nights, 1.2 min earlier after warmer compared to colder nights, and 2.5 min earlier at 2200 m than at 1500 m a.s.l. The effects of date were more species-specific: Alpine Tits started to sing on average 4.9 min later at the end compared to the beginning of the study period, whilst Song Thrushes started to sing 9.0 min earlier. Our findings are in line with the results of previous studies on the effects of road noise, nocturnal light, and partly on temperature. Our study shows that variation in environmental variables may influence the start of dawn singing in different ways, and that anthropogenic factors like road noise can affect bird behaviour even in a highly protected area.
A paradigm shift away from null hypothesis significance testing seems in progress. Based on simulations, we illustrate some of the underlying motivations. First, p‐values vary strongly from study to ...study, hence dichotomous inference using significance thresholds is usually unjustified. Second, ‘statistically significant’ results have overestimated effect sizes, a bias declining with increasing statistical power. Third, ‘statistically non‐significant’ results have underestimated effect sizes, and this bias gets stronger with higher statistical power. Fourth, the tested statistical hypotheses usually lack biological justification and are often uninformative. Despite these problems, a screen of 48 papers from the 2020 volume of the Journal of Evolutionary Biology exemplifies that significance testing is still used almost universally in evolutionary biology. All screened studies tested default null hypotheses of zero effect with the default significance threshold of p = 0.05, none presented a pre‐specified alternative hypothesis, pre‐study power calculation and the probability of ‘false negatives’ (beta error rate). The results sections of the papers presented 49 significance tests on average (median 23, range 0–390). Of 41 studies that contained verbal descriptions of a ‘statistically non‐significant’ result, 26 (63%) falsely claimed the absence of an effect. We conclude that studies in ecology and evolutionary biology are mostly exploratory and descriptive. We should thus shift from claiming to ‘test’ specific hypotheses statistically to describing and discussing many hypotheses (possible true effect sizes) that are most compatible with our data, given our statistical model. We already have the means for doing so, because we routinely present compatibility (‘confidence’) intervals covering these hypotheses.
Inference in ecology and evolution is still mostly based on significance testing. We summarize problems with this approach and argue that it should be replaced by the adequate description of effect size estimates.
The browsing of wild ungulates can have profound effects on the structure and composition of forests. In the Swiss National Park, the density of wild ungulates, including red deer (Cervus elaphus), ...ibex (Capra ibex), and chamois (Rupicapra rupicapra), is exceptionally high due to strict protection and the absence of large predators. We examined count data of larch (Larix decidua), cembra pine (Pinus cembra), spruce (Picea abies), upright mountain pine (Pinus mugo subsp. uncinata), and mountain ash (Sorbus aucuparia) of four sampling years between 1991 and 2021, and modelled how topographic and location factors affected the probability of browsing on saplings of larch, cembra pine, and spruce. Despite the high density of wild ungulates, the number of saplings and young trees has increased over the past 30 years. The probability of browsing on saplings was highest for larch at a height of 10–40 cm and increased with increasing elevation. In our study area, open grasslands are mainly located above the tree line, which might explain the positive correlation between elevation and the probability of browsing. Further, the probability of browsing was related to exposition and slope, diversity of tree species, and disturbance by humans. It appears that in the investigated part of the Swiss National Park, the potential of the forest to regenerate has increased despite the high densities of wild ungulates.