In some fields of research, psychologists are interested in effect sizes that are large enough to make a difference to people’s subjective experience. Recently, an anchor-based method using a global ...rating of change was proposed as a way to quantify the smallest subjectively experienced difference—the smallest numerical difference in the outcome measure that, on average, corresponds to reported changes in people’s subjective experience. According to the method, the construct of interest is measured on two occasions (Time 1 and Time 2). At Time 2, people also use an anchor-item to report how much they experienced a change in the construct. Participants are then categorized as those who stayed the same, those who changed a lot, and those who changed a little. The average change score for those who changed a little is the estimate of the smallest subjectively experienced difference. In the present study, I examined two aspects of the method’s validity. First, I tested whether presenting the anchor-item before or after the Time 2 outcome measure influences the results. The results suggest that any potential influence of the anchor-position, assuming there is an influence, is likely to be small. Second, I examined the anchor-item’s validity correlations when the delay between Time 1 and 2 is one day to also see if the pattern is similar to past research where the delay was two and five days. The observed pattern of validity correlations was very similar. I note directions for future research.
A study was designed to validate in Spanish setting the gold standard for the measure of forgiveness, the Enright Forgiveness Inventory-30 (EFI-30). The EFI-30, a measure of forgiveness and a measure ...of social desirability were istered to a sample of 623 participants (66.8 % female, 32.4% male, 0.8% did not report) with a mean age of 29 years (SD = 14.65). Results: A confirmatory factorial analysis showed the best fit indexes (construct validity) for the seminal 6-factor structure (positive affect, negative affect, positive behavior, negative behavior, positive cognition, negative cognition) of the EFI-30 (factorial variance), a correlation with the measure of forgiveness (convergent validity), and a non-significant correlation with social desirability (discriminant validity). Validity; Reliability; Spanish validation; Factorial structure; Forgiveness model.
Narrative competence has been defined as a bridge between oral and written language, given that it is acquired before children formally learn to read. This competence has shown to be a relevant ...factor in reading comprehension and school learning. This study examines the narrative comprehension task proposed by Paris and Paris (2003) and adapted in Chile by Silva et al. (2014) to detect changes in the development of this skill in a purposive sample of 172 Chilean preschoolers aged 2-4 years (121) and 4-6 years (51), 52% of whom were girls. The children, who resided in areas with an average vulnerability index of 86%, attended 9 subsidized private schools in the Metropolitan Region (72% of the sample) and 6 municipal schools in the BioBío Region (29% of the sample). Results show the tool’s structural validity, differential functioning by sex, and adequate internal consistency. Furthermore, the tool exhibits developmental sensitivity, yielding different results according to student age or educational level. The availability of instruments of this type makes it possible to identify children’s progress in this domain and organize pedagogical work to enhance their learning in early childhood education.
This book examines test validity in the behavioral, social, and educational sciences by exploring three fundamental problems: measurement, causation and meaning. Psychometric and philosophical ...perspectives receive attention along with unresolved issues. The authors explore how measurement is conceived from both the classical and modern perspectives. The importance of understanding the underlying concepts as well as the practical challenges of test construction and use receive emphasis throughout. The book summarizes the current state of the test validity theory field. Necessary background on test theory and statistics is presented as a conceptual overview where needed.
Each chapter begins with an overview of key material reviewed in previous chapters, concludes with a list of suggested readings, and features boxes with examples that connect theory to practice. These examples reflect actual situations that occurred in psychology, education, and other disciplines in the US and around the globe, bringing theory to life. Critical thinking questions related to the boxed material engage and challenge readers. A few examples include:
What is the difference between intelligence and IQ?
Can people disagree on issues of value but agree on issues of test validity?
Is it possible to ask the same question in two different languages?
The first part of the book contrasts theories of measurement as applied to the validity of behavioral science measures.The next part considers causal theories of measurement in relation to alternatives such as behavior domain sampling, and then unpacks the causal approach in terms of alternative theories of causation.The final section explores the meaning and interpretation of test scores as it applies to test validity. Each set of chapters opens with a review of the key theories and literature and concludes with a review of related open questions in test validity theory.
Researchers, practitione
In recent years, psychology has wrestled with the broader implications of disappointing rates of replication of previously demonstrated effects. This article proposes that many aspects of this ...pattern of results can be understood within the classic framework of four proposed forms of validity: statistical conclusion validity, internal validity, construct validity, and external validity. The article explains the conceptual logic for how differences in each type of validity across an original study and a subsequent replication attempt can lead to replication “failure.” Existing themes in the replication literature related to each type of validity are also highlighted. Furthermore, empirical evidence is considered for the role of each type of validity in non-replication. The article concludes with a discussion of broader implications of this classic validity framework for improving replication rates in psychological research.
In 1998, Greenwald, McGhee, and Schwartz proposed that the Implicit Association Test (IAT) measures individual differences in implicit social cognition. This claim requires evidence of construct ...validity. I review the evidence and show that there is insufficient evidence for this claim. Most important, I show that few studies were able to test discriminant validity of the IAT as a measure of implicit constructs. I examine discriminant validity in several multimethod studies and find little or no evidence of discriminant validity. I also show that validity of the IAT as a measure of attitudes varies across constructs. Validity of the self-esteem IAT is low, but estimates vary across studies. About 20% of the variance in the race IAT reflects racial preferences. The highest validity is obtained for measuring political orientation with the IAT (64%). Most of this valid variance stems from a distinction between individuals with opposing attitudes, whereas reaction times contribute less than 10% of variance in the prediction of explicit attitude measures. In all domains, explicit measures are more valid than the IAT, but the IAT can be used as a measure of sensitive attitudes to reduce measurement error by using a multimethod measurement model.
The present study explores the plausibility of measuring personality indirectly through an artificial intelligence (AI) chatbot. This chatbot mines various textual features from users' free text ...responses collected during an online conversation/interview and then uses machine learning algorithms to infer personality scores. We comprehensively examine the psychometric properties of the machine-inferred personality scores, including reliability (internal consistency, split-half, and test-retest), factorial validity, convergent and discriminant validity, and criterion-related validity. Participants were undergraduate students (n = 1,444) enrolled in a large southeastern public university in the United States who completed a self-report Big Five personality measure (IPIP-300) and engaged with an AI chatbot for approximately 20-30 min. In a subsample (n = 407), we obtained participants' cumulative grade point averages from the University Registrar and had their peers rate their college adjustment. In an additional sample (n = 61), we obtained test-retest data. Results indicated that machine-inferred personality scores (a) had overall acceptable reliability at both the domain and facet levels, (b) yielded a comparable factor structure to self-reported questionnaire-derived personality scores, (c) displayed good convergent validity but relatively poor discriminant validity (averaged convergent correlations = .48 vs. averaged machine-score correlations = .35 in the test sample), (d) showed low criterion-related validity, and (e) exhibited incremental validity over self-reported questionnaire-derived personality scores in some analyses. In addition, there was strong evidence for cross-sample generalizability of psychometric properties of machine scores. Theoretical implications, future research directions, and practical considerations are discussed.
Background: Dietary intake is an important determinant of health. Its adjustments can improve the overall risk for noncommunicable diseases, especially for patients with obesity. Dietary assessment ...tools are commonly used to measure selfreported habitual energy intake (EI) and the overall diet. However, these tools are affected by voluntary and unvoluntary misreporting. Therefore, a preference for technology-based approaches has emerged. SNAQ is an image-based food recognition app. It has been developed to remotely measure dietary intake with a real-time transfer of data. Its validity in the measurement of EI has not been reported yet. The DLW methods is the gold standard for assessment of average daily energy expenditure (EE) in free-living conditions. It is commonly used to test the validity of other dietary assessment tools. Methods: Dietary intake was recorded for seven days with SNAQ in 30 study participants with and 30 without obesity. Urine samples for DLW analysis were obtained with a two-point protocol. Participants captured before and after pictures of each dietary item. Pictures were automatically uploaded, dietary items and portion sizes were estimated by a trained deep-learning model, and energy, macro-, and micronutrients were calculated. Paired t-tests, mean differences, linear correlations, and Bland-Altman limits of agreement were performed. Results: EI was mostly underestimated by SNAQ when compared to the DLW method. However, underestimation of EI by SNAQ correlated with the changes in body weight during the study week. Conclusions: We report a low validity of this method for the measurement of energy intake in free-living, healthy-weight adult women. SNAQ underestimated total EI when compared to the DLW estimations of EE. However, SNAQ seemed to perform better in participants with obesity. Further refinement of the use of SNAQ is needed for investigation of EI in free-living conditions.
Abstract
In recent years, increasing attention has been paid to problems of external validity, specifically to methodological approaches for both quantitative generalizability and transportability of ...study results. However, most approaches to these issues have considered external validity separately from internal validity. Here we argue that considering either internal or external validity in isolation may be problematic. Further, we argue that a joint measure of the validity of an effect estimate with respect to a specific population of interest may be more useful: We call this proposed measure target validity. In this work, we introduce and formally define target bias as the total difference between the true causal effect in the target population and the estimated causal effect in the study sample, and target validity as target bias = 0. We illustrate this measure with a series of examples and show how this measure may help us to think more clearly about comparisons between experimental and nonexperimental research results. Specifically, we show that even perfect internal validity does not ensure that a causal effect will be unbiased in a specific target population.