Antecedentes: en los últimos años la adaptación de los tests de unas culturas a otras se ha incrementado en todos los ámbitos evaluativos. Vivimos en un entorno cada vez más multicultural y ...multilingüe en el que los tests se utilizan como apoyo en la toma de decisiones. El objetivo de este trabajo es presentar la segunda edición de las directrices de la Comisión Internacional de Tests (ITC) para la adaptación de los tests de unas culturas a otras. Método: un grupo de seis expertos internacionales revisaron las directrices originales propuestas por la Comisión Internacional de Tests, teniendo en cuenta los avances habidos en el campo desde su formulación inicial. Resultados: la nueva edición está compuesta por veinte directrices agrupadas en seis apartados: directrices previas, desarrollo del test, confirmación, aplicación, puntuación e interpretación y documentación. Se analizan los diferentes apartados, y se estudian las posibles fuentes de error que pueden influir en el proceso de traducción y adaptación de los tests. Conclusiones: se proponen veinte directrices para guiar la traducción y adaptación de los tests de unas culturas a otras. Se discuten las perspectivas futuras de las directrices en relación con los nuevos desarrollos en el ámbito de la evaluación psicológica y educativa.
Adapting tests across cultures is a common practice that has increased in all evaluation areas in recent years. We live in an increasingly multicultural and multilingual world in which the tests are ...used to support decision-making in the educational, clinical, organizational and other areas, so the adaptation of tests becomes a necessity. The main goal of this paper is to present the second edition of the guidelines of the International Test Commission (ITC) for adapting tests across cultures.
A task force of six international experts reviewed the original guidelines proposed by the International Test Commission, taking into account the advances and developments of the field.
As a result of the revision this new edition consists of twenty guidelines grouped into six sections: Precondition, test development, confirmation, administration, score scales and interpretation, and document. The different sections are reviewed, and the possible sources of error influencing the tests translation and adaptation analyzed.
Twenty guidelines are proposed for translating and adapting tests across cultures. Finally we discuss the future perspectives of the guidelines in relation to the new developments in the field of psychological and educational assessment.
Background: To improve the quality of test translation and adaptation, and hence the comparability of scores across cultures, the International Test Commission (ITC) proposed a number of guidelines ...for the adaptation process. Although these guidelines are well-known, they are not implemented as often as they should be. One possible reason for this is the broad scope of the guidelines, which makes them difficult to apply in practice. The goal of this study was therefore to draw up an evaluative criterion checklist that would help test adapters to implement the ITC recommendations and which would serve as a model for assessing the quality of test adaptations. Method: Each ITC guideline was operationalized through a number of criteria. For each criterion, acceptable and excellent levels of accomplishment were proposed. The initial checklist was then reviewed by a panel of 12 experts in testing and test adaptation. The resulting checklist was applied to two different tests by two pairs of independent reviewers. Results: The final evaluative checklist consisted of 29 criteria covering all phases of test adaptation: planning, development, confirmation, administration, score interpretation, and documentation. Conclusions: We believe that the proposed evaluative checklist will help to improve the quality of test adaptation.
Adapting Educational and Psychological Tests for Cross-Cultural Assessment critically examines and advances new methods and practices for adapting tests for cross-cultural assessment and research. ...The International Test Commission (ITC) guidelines for test adaptation and conceptual and methodological issues in test adaptation are described in detail, and questions of ethics and concern for validity of test scores in cross-cultural contexts are carefully examined. Advances in test translation and adaptation methodology, including statistical identification of flawed test items, establishing equivalence of different language versions of a test, and methodologies for comparing tests in multiple languages, are reviewed and evaluated. The book also focuses on adapting ability, achievement, and personality tests for cross-cultural assessment in educational, industrial, and clinical settings.
This book furthers the ITC's mission of stimulating research on timely topics associated with assessment. It provides an excellent resource for courses in psychometric methods, test construction, and educational and/or psychological assessment, testing, and measurement. Written by internationally known scholars in psychometric methods and cross-cultural psychology, the collection of chapters should also provide essential information for educators and psychologists involved in cross-cultural assessment, as well as students aspiring to such careers.
Contents: Preface. Part I: Cross-Cultural Adaptation of Educational and Psychological Tests: Theoretical and Methodological Issues. R.K. Hambleton, Issues, Designs, and Technical Guidelines for Adapting Tests Into Multiple Languages and Cultures. F.J.R. van de Vijver, Y.H. Poortinga, Conceptual and Methodological Issues in Adapting Tests. T. Oakland, Selected Ethical Issues Relevant to Test Adaptations. S.G. Sireci, L. Patsula, R.K. Hambleton, Statistical Methods for Identifying Flaws in the Test Adaptation Process. S.G. Sireci, Using Bilinguals to Evaluate the Comparability of Different Language Versions of a Test. L.L. Cook, A.P. Schmitt-Cascallar, Establishing Score Comparability for Tests Given in Different Languages. L.L. Cook, A.P. Schmitt-Cascallar, C. Brown, Adapting Achievement and Aptitude Tests: A Review of Methodological Issues. Part II: Cross-Cultural Adaptation of Educational and Psychological Tests: Applications to Achievement, Aptitude, and Personality Tests. C.T. Fitzgerald, Test Adaptation in a Large-Scale Certification Program. C.Y. Maldonado, K.F. Geisinger, Conversion of the Wechsler Adult Intelligence Scale Into Spanish: An Early Test Adaption Effort of Considerable Consequence. N.K. Tanzer, Developing Tests for Use in Multiple Languages and Cultures: A Plea for Simultaneous Development. F. Drasgow, T.M. Probst, The Psychometrics of Adaptation: Evaluating Measurement Equivalence Across Languages and Cultures. M. Beller, N. Gafni, P. Hanani, Constructing, Adapting, and Validating Admissions Tests in Multiple Languages: The Israeli Case. P.F. Merenda, Cross-Cultural Adaptation of Educational and Psychological Testing. C.D. Spielberger, M.S. Moscoso, T.M. Brunner, Cross-Cultural Assessment of Emotional States and Personality Traits.
Using PISA 2012 data, the present study explored profiles of mathematics anxiety (MA) among 15-year old students from Finland, Korea, and the United States to determine the similarities and ...differences of MA across the three national samples by applying a multi-group latent profile analysis (LPA). The major findings were that (a) three MA profiles were found in all three national samples, i.e., Low MA, Mid MA, and High MA profile, and (b) the percentages of students classified into each of the three MA profiles differed across the Finnish, Korean, and American samples, with United States having the highest prevalence of High MA, and Finland the lowest. Multi-group LPA also provided clear and useful latent profile separation. The High MA profile demonstrated significant poorer mathematics performance and lower mathematics interest, self-efficacy, and self-concept than the Mid and Low MA profiles. Same differences appeared between the Mid and Low MA profiles. The implications of the findings seem clear: (1) it is possible that there is some relative level of universality in MA among 15-year old students which is independent of cultural context; and (2) multi-group LPA could be a useful analytic tool for research on the study of classification and cultural differences of MA.
Item response theory (IRT) has become a popular methodological framework for modeling response data from assessments in education and health; however, its use is not widespread among psychologists. ...This paper aims to provide a didactic application of IRT and to highlight some of these advantages for psychological test development. IRT was applied to two scales (a positive and a negative affect scale) of a self-report test. Respondents were 853 university students (57 % women) between the ages of 17 and 35 and who answered the scales. IRT analyses revealed that the positive affect scale has items with moderate discrimination and are measuring respondents below the average score more effectively. The negative affect scale also presented items with moderate discrimination and are evaluating respondents across the trait continuum; however, with much less precision. Some features of IRT are used to show how such results can improve the measurement of the scales. The authors illustrate and emphasize how knowledge of the features of IRT may allow test makers to refine and increase the validity and reliability of other psychological measures.
In item response theory (IRT) models, assessing model-data fit is an essential step in IRT calibration. While no general agreement has ever been reached on the best methods or approaches to use for ...detecting misfit, perhaps the more important comment based upon the research findings is that rarely does the research evaluate IRT misfit by focusing on the practical consequences of misfit. The study investigated the practical consequences of IRT model misfit in examining the equating performance and the classification of examinees into performance categories in a simulation study that mimics a typical large-scale statewide assessment program with mixed-format test data. The simulation study was implemented by varying three factors, including choice of IRT model, amount of growth/change of examinees' abilities between two adjacent administration years, and choice of IRT scaling methods. Findings indicated that the extent of significant consequences of model misfit varied over the choice of model and IRT scaling methods. In comparison with mean/sigma (MS) and Stocking and Lord characteristic curve (SL) methods, separate calibration with linking and fixed common item parameter (FCIP) procedure was more sensitive to model misfit and more robust against various amounts of ability shifts between two adjacent administrations regardless of model fit. SL was generally the least sensitive to model misfit in recovering equating conversion and MS was the least robust against ability shifts in recovering the equating conversion when a substantial degree of misfit was present. The key messages from the study are that practical ways are available to study model fit, and, model fit or misfit can have consequences that should be considered when choosing an IRT model. Not only does the study address the consequences of IRT model misfit, but also it is our hope to help researchers and practitioners find practical ways to study model fit and to investigate the validity of particular IRT models for achieving a specified purpose, to assure that the successful use of the IRT models are realized, and to improve the applications of IRT models with educational and psychological test data.
Repeatedly using items in high-stake testing programs provides a chance for test takers to have knowledge of particular items in advance of test administrations. A predictive checking method is ...proposed to detect whether a person uses preknowledge on repeatedly used items (i.e., possibly compromised items) by using information from secure items that have zero or very low exposure rates. Responses on the secure items are first used to estimate a person’s proficiency distribution, and then the corresponding predictive distribution for the person’s responses on the possibly compromised items is constructed. The use of preknowledge is identified by comparing the observed responses to the predictive distribution. Different estimation methods for obtaining a person’s proficiency distribution and different choices of test statistic in predictive checking are considered. A simulation study was conducted to evaluate the empirical Type I error and power rate of the proposed method. The simulation results suggested that the Type I error of this method is well controlled, and this method is effective in detecting preknowledge when a large proportion of items are compromised even with a short secure section. An empirical example is also presented to demonstrate its practical use.
Background: The construction and evaluation of item banks to measure unidimensional constructs of health-related quality of life (HRQOL) is a fundamental objective of the Patient-Reported Outcomes ...Measurement Information System (PROMIS) project. Objectives: Item banks will be used as the foundation for developing short-form instruments and enabling computerized adaptive testing. The PROMIS Steering Committee selected 5 HRQOL domains for initial focus: physical functioning, fatigue, pain, emotional distress, and social role participation. This report provides an overview of the methods used in the PROMIS item analyses and proposed calibration of item banks. Analyses: Analyses include evaluation of data quality (eg, logic and range checking, spread of response distribution within an item), descriptive statistics (eg, frequencies, means), item response theory model assumptions (unidimensionality, local independence, monotonicity), model fit, differential item functioning, and item calibration for banking. Recommendations: Summarized are key analytic issues; recommendations are provided for future evaluations of item banks in HRQOL assessment.