Contemporary Guidance for Stated Preference Studies Johnston, Robert J.; Boyle, Kevin J.; Adamowicz, Wiktor (Vic) ...
Journal of the Association of Environmental and Resource Economists,
06/2017, Letnik:
4, Številka:
2
Journal Article
Recenzirano
Odprti dostop
This article proposes contemporary best-practice recommendations for stated preference (SP) studies used to inform decision making, grounded in the accumulated body of peer-reviewed literature. These ...recommendations consider the use of SP methods to estimate both use and non-use (passive-use) values, and cover the broad SP domain, including contingent valuation and discrete choice experiments. We focus on applications to public goods in the context of the environment and human health but also consider ways in which the proposed recommendations might apply to other common areas of application. The recommendations recognize that SP results may be used and reused (benefit transfers) by governmental agencies and nongovernmental organizations, and that all such applications must be considered. The intended result is a set of guidelines for SP studies that is more comprehensive than that of the original National Oceanic and Atmospheric Administration (NOAA) Blue Ribbon Panel on contingent valuation, is more germane to contemporary applications, and reflects the two decades of research since that time. We also distinguish between practices for which accumulated research is sufficient to support recommendations and those for which greater uncertainty remains. The goal of this article is to raise the quality of SP studies used to support decision making and promote research that will further enhance the practice of these studies worldwide.
There is considerable controversy in the literature about how to measure the knowledge of the general public. Much of the past work concerns political knowledge and has focused on two issues—whether ...don't know (DK) responses should be encouraged or discouraged and whether the items should take a multiple-choice or open-ended format. Similar questions have been raised about the best way to measure the public's knowledge of basic science facts, which the National Science Board (NSB) has monitored for more than forty years. The NSB battery consists of eleven items, ten of them true-false items. The introduction to the items encourages DK responses. We carried out an experiment that compared true-false and forced-choice versions of the items; the experiment also compared versions of the questions that discouraged DK responses (thereby increasing guessing), encouraged them (decreasing guessing), or simply omitted a DK option. With a few items, the forced-choice format was harder than the true-false (i.e., fewer respondents answered them correctly), but with others that format was easier, with no overall pattern. Similarly, the reliability, unidimensionality, and validity of scale scores did not differ by question format. By contrast, encouraging DKs improved the reliability, unidimensionality, and validity of the battery relative to the other two DK conditions. We present a model that shows that discouraging DKs (thereby encouraging guesses) improves the measurement of knowledge only when it increases educated guesses more than it increases blind guesses. That apparently is not true for the science knowledge items we examine.
Survey researchers have been investigating alternative approaches to reduce data collection costs while mitigating the risk of nonresponse bias or to produce more accurate estimates within the same ...budget. Responsive or adaptive design has been suggested as one means for doing this. Falling survey response rates and the need to find effective ways of implementing responsive design has focused attention on the relationship between response rates and nonresponse bias. In our article, we re-examine the data compiled by Groves and Peytcheva (2008) in their influential article and show there is an important between-study component of variance in addition to the within-study variance highlighted in the original analysis. We also show that theory implies that raising response rates can help reduce the nonresponse bias on average across the estimates within a study. We then propose a typology of response propensity models that help explain the empirical findings, including the relative weak relationship between nonresponse rates and nonresponse bias. Using these results, we explore when responsive design tools such as switching modes, giving monetary incentives, and increasing the level of effort are likely to be effective. We conclude with some comments on the use of responsive design and weighting to control nonresponse bias.
Abstract
All survey items reflect some conceptual framework that might or might not be accepted by subgroups with certain personal identities. For example, respondents with certain religious ...identities may reject the scientific framework of questions about the development of life and origins of the universe since there are competing truth claims between religion and science on these topics. Since the late 1970s, the National Science Foundation has sponsored a series of surveys to gauge public attitudes toward and understanding of science and technology. Items that simultaneously measure knowledge and acceptance of two concepts—evolution and the “big bang”—appear to raise measurement problems for a specific subgroup that rejects the premise of the items. This paper focuses on alternative versions of the survey questions that attempt to remove the effect of religious belief on answers to these items. We investigate two approaches for removing this confounding of knowledge and acceptance. One approach is to ask what scientists think rather than what the respondents believe; the other is to remove “hot-button” features of the question likely to trigger conflicts between the religious and scientific views. We also illustrate how psychometric methods (such as confirmatory factor analysis) can help sort out which version of the questions produces the most valid answers.
The paper reviews the growing literature on responsive and adaptive designs for surveys. These designs encompass various methods for managing data collection, including front loading potentially ...difficult cases, tailoring data collection strategies to different subgroups, prioritizing effort according to estimated response propensities, imposing stop rules for ending data collection, monitoring key survey estimates throughout the field period, using two-phase or multiphase sampling for following up non-respondents and calculating indicators of non-response bias (such as the R-indicator) other than response rates to monitor and guide fieldwork. We give particular attention to efforts to evaluate these strategies experimentally or via simulations. Although the field seems to have embraced these new tools, most of the evaluation studies suggest they produce marginal reductions in cost and non-response bias. It is clearly difficult to lower survey costs without reducing some aspect of survey quality. Other issues limiting the effectiveness of these designs include weakly predictive auxiliary variables, ineffective interventions and slippage in the implementation of interventions in the field. These problems are not, however, unique to responsive or adaptive design. We give recommendations for improving such designs and for improving the management of data collection efforts in the current difficult environment for surveys.
Using reinterview data from the PATH Reliability and Validity (PATH-RV) study, we examine the characteristics of questions and respondents that predict the reliability of the answers. In the PATH-RV ...study, 524 respondents completed an interview twice, five to twenty-four days apart. We coded a number of question characteristics and used them to predict the gross discrepancy rates (GDRs) and kappas for each question. We also investigated respondent characteristics associated with reliability. Finally, we fitted cross-classified models that simultaneously examined a range of respondent and question characteristics. Although the different models yielded somewhat different conclusions, in general factual questions (especially demographic questions), shorter questions, questions that did not use scales, those with fewer response options, and those that asked about a noncentral topic produced more reliable answers than attitudinal questions, longer questions, questions using ordinal scales, those with more response options, and those asking about a central topic. One surprising finding was that items raising potential social desirability concerns yielded more reliable answers than items that did not raise such concerns. The respondent-level models and cross-classified models indicated that five adult respondent characteristics were associated with giving the same answer in both interviews-education, the Big Five trait of conscientiousness, tobacco use, sex, and income. Hispanic youths and non-Hispanic black youths were less likely to give the same answer in both interviews. The cross-classified model also found that more words were associated with less reliable answers. The results are mostly consistent with earlier findings but are nonetheless important because they are much less model-dependent than the earlier work. In addition, this study is the first to incorporate such personality traits as needed for cognition and the Big Five personality factors and to examine the relationships among reliability, item nonresponse, and response latency.
Comparing Methods for Assessing Reliability Tourangeau, Roger; Sun, Hanyu; Yan, Ting
Journal of survey statistics and methodology,
09/2021, Letnik:
9, Številka:
4
Journal Article
Odprti dostop
The usual method for assessing the reliability of survey data has been to conduct reinterviews a short interval (such as one to two weeks) after an initial interview and to use these data to estimate ...relatively simple statistics, such as gross difference rates (GDRs). More sophisticated approaches have also been used to estimate reliability. These include estimates from multi-trait, multi-method experiments, models applied to longitudinal data, and latent class analyses. To our knowledge, no prior study has systematically compared these different methods for assessing reliability. The Population Assessment of Tobacco and Health Reliability and Validity (PATH-RV) Study, done on a national probability sample, assessed the reliability of answers to the Wave 4 questionnaire from the PATH Study. Respondents in the PATH-RV were interviewed twice about two weeks apart. We examined whether the classic survey approach yielded different conclusions from the more sophisticated methods. We also examined two ex ante methods for assessing problems with survey questions and item nonresponse rates and response times to see how strongly these related to the different reliability estimates. We found that kappa was highly correlated with both GDRs and over-time correlations, but the latter two statistics were less highly correlated, particularly for adult respondents; estimates from longitudinal analyses of the same items in the main PATH study were also highly correlated with the traditional reliability estimates. The latent class analysis results, based on fewer items, also showed a high level of agreement with the traditional measures. The other methods and indicators had at best weak relationships with the reliability estimates derived from the reinterview data. Although the Question Understanding Aid seems to tap a different factor from the other measures, for adult respondents, it did predict item nonresponse and response latencies and thus may be a useful adjunct to the traditional measures.
Prepaid survey incentives are an attractive tool for attempting to combat declining response rates. However, researchers have expressed concern that the use of incentives may bring less motivated ...respondents into the sample, reducing response quality. In this paper, we explore whether a prepaid cash incentive in a telephone survey affected the prevalence of 11 indicators of response quality by comparing the answers of respondents who received a $5 incentive with those of respondents who did not receive one. Significant differences between the incentive and control groups are observed for only two of these 11 indicators; the respondents who received the incentive had less item nonresponse but spent less time per question. We find some evidence that these effects may be limited to the second half of the questionnaire. We also assess whether the effect of the incentive varied according to respondent characteristics, such as age, educational attainment, and household income, finding minimal evidence that this was the case. Overall, these findings should be reassuring for researchers considering the use of prepaid cash incentives.
We present the results of six experiments that demonstrate the impact of visual features of survey questions on the responses they elicit, the response process they initiate, or both. All six ...experiments were embedded in Web surveys. Experiments 1 and 2 investigate the effects of the placement of nonsubstantive response options (for example, "No opinion" and "Don't know" answer options) in relation to the substantive options. The results suggest that when these options are not differentiated visually (by a line or a space) from the substantive options, respondents may be misled about the midpoint of the scale; respondents seemed to use the visual rather than the conceptual midpoint of the scale as a reference point for responding. Experiment 3, which varied the spacing of the substantive options, showed a similar result. Responses were pushed in the direction of the visual midpoint when it fell to one side of the conceptual midpoint of the response scale. Experiment 4 examined the effects of varying whether the response options, which were arrayed vertically, followed a logical progression from top to bottom. Respondents answered more quickly when the options followed a logical order. Experiment 5 examined the effects of the placement of an unfamiliar item among a series of similar items. For example, one set of items asked respondents to say whether several makes and models of cars were expensive or not. The answers for the unfamiliar items depended on the items that were nearby on the list. Our last experiment varied whether a battery of related items was administered on a single screen, across two screens, or with each item on its own screen. The intercorrelations among the items were highest when they were all on the same screen. Respondents seem to apply interpretive heuristics in assigning meaning to visual cues in questionnaires. They see the visual midpoint of a scale as representing the typical or middle response; they expect options to be arrayed in a progression beginning with the leftmost or topmost item; and they expect items that are physically close to be related to each other conceptually.