Understanding and measuring sentence acceptability is of fundamental importance for linguists, but although many measures for doing so have been developed, relatively little is known about some of ...their psychometric properties. In this paper we evaluate within- and between-participant test-retest reliability on a wide range of measures of sentence acceptability. Doing so allows us to estimate how much of the variability within each measure is due to factors including participant-level individual differences, sample size, response styles, and item effects. The measures examined include Likert scales, two versions of forced-choice judgments, magnitude estimation, and a novel measure based on Thurstonian approaches in psychophysics. We reproduce previous findings of high between-participant reliability within and across measures, and extend these results to a generally high reliability within individual items and individual people. Our results indicate that Likert scales and the Thurstonian approach produce the most stable and reliable acceptability measures and do so with smaller sample sizes than the other measures. Moreover, their agreement with each other suggests that the limitation of a discrete Likert scale does not impose a significant degree of structure on the resulting acceptability judgments.
It is well known that people attempting to perform hypothesis testing show a positive test bias, preferring to request evidence that is consistent (rather than inconsistent) with their current ...hypothesis. Rather than viewing this as an irrational bias, information theoretic accounts of hypothesis testing have argued that selecting tests likely to produce positive evidence is adaptive when most hypotheses are small (i.e., true of few entities in the world) and respond positively to very few queries. These accounts make the prediction that as hypotheses get larger, the relative utility of positive evidence will decrease; when hypotheses are large enough, negative evidence will become more useful than positive evidence. We test if people are sensitive to this change in utility with an experiment inspired by the game "Battleship," in which people attempt to discover the correct arrangement of ships by asking for positive or negative evidence. As predicted, as hypotheses become larger people request less positive evidence, and when hypotheses are large requests for negative evidence are more likely than requests for positive evidence. Implications for the nature of the positive test bias are discussed.