A critical, but often neglected, component of any large-scale assessment program is the reporting of test results. In the past decade, a body of evidence has been compiled that raises concerns over ...the ways in which these results are reported to and understood by their intended audiences. In this study, current approaches for reporting student-level results on large-scale assessments were investigated. Recent student test score reports and interpretive guides from 11 states, three U.S. commercial testing companies, and two Canadian provinces were reviewed. On the basis of past score-reporting research, testing standards, and the requirements of the No Child Left Behind Act of 2001, a number of promising and potentially problematic features of these reports and guides are identified, and recommendations are offered to help enhance future score-reporting designs and to inform future research in this important area.
Celotno besedilo
Dostopno za:
BFBNIB, DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
The purpose of the present study was to extend past work with the Angoff method for setting standards by examining judgments at the judge level rather than the panel level. The focus was on ...investigating the relationship between observed Angoff standard setting judgments and empirical conditional probabilities. This relationship has been used as a measure of internal consistency by previous researchers. Results indicated that judges varied in the degree to which they were able to produce internally consistent ratings; some judges produced ratings that were highly correlated with empirical conditional probabilities and other judges' ratings had essentially no correlation with the conditional probabilities. The results also showed that weighting procedures applied to individual judgments both increased panel-level internal consistency and produced convergence across panels.
Celotno besedilo
Dostopno za:
BFBNIB, DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
The goal of the present study is to develop a questionnaire, with proper psychometric properties and current norms, to evaluate the burnout syndrome in Spain. The operative definition of burnout ...proposed by Maslach and Jackson is used to define three dimensions (Emotional exhaustion, Depersonalization and Personal accomplishment). A total of 2,403 national Spanish police participated. Evidence of construct validity was checked through cross validation (showing a good fit of the three factors model to the data). Using the MBI, NEO-FII and CECAD evidence of convergent validity and criteria validity were developed (showing that the relations are similar to the ones that appear in other research). The discrimination, mean, standard deviation, and typical error of the average of the items composing the various dimensions were analyzed. Both the Cronbach's alpha coefficient and the conditional standard error of measurement (CSEM) were calculated for each of the dimensions. The results showed good internal consistency (all α values > .85). Finally, the questionnaire was scaled using T scores. The psychometrical properties reported here support the use of this new questionnaire for the burnout evaluation in Spanish police.
El objetivo del presente estudio es desarrollar un cuestionario, con propiedades psicométricas adecuadas y baremos actuales, para evaluar el síndrome de burnout en España. Se utiliza la definición de burnout propuesta por Maslach y Jackson para definir las tres dimensiones (Cansancio emocional, Despersonalización y Realización personal). Participan un total de 2.403 policías nacionales españoles. Se estudian evidencias de validez de constructo mediante validación cruzada (encontrándose un buen ajuste del modelo de tres factores a los datos). Se utilizan MBI, NEO-FII y CECAD para obtener evidencias de validez convergente y validez de criterio (se encuentran relaciones similares a las que se informan en otras investigaciones). Se analizan la discriminación, media, desviación típica y error típico de la media de los ítems que forman parte de las citadas dimensiones. Se calcula tanto el coeficiente alfa de Cronbach como el error estándar de medida condicional (CSEM) para cada una de las dimensiones del cuestionario. Los resultados muestran una buena consistencia interna (todos los valores α > .85). Finalmente, el cuestionario fue baremado utilizando puntuaciones T. Sus propiedades psicométricas, apoyan el uso de este nuevo cuestionario para la evaluación del burnout en policías españoles.
...remarks about the articles in this special issue are provided in order to set a context for readers.II International Test Commission guidelines for translating and adapting testsIn 1992 the ...International Test Commission (ITC) began a project to prepare guidelines for translating and adapting tests and psychological instruments, and establishing score equivalence across language and/or cultural groups. Zumbo provides convincing evidence that item level bias can still be present when structural equation modeling of the test in two languages reveals an equivalent factorial structure. Since it is the scores from a test or instrument that are ultimately used to achieve the intended purpose, the scores may be contaminated by item level bias and, ultimately, valid inferences from the test scores become problematic.Stephen Sireci from the University of Massachusetts in the USA and Avi Allalouf from the National Institute for Testing and Evaluation in Israel continue the theme introduced by Zumbo. ...steps include adaptations being prepared by separate translators, and then the products of these translators being judged by other translators ultimately to produce a single adaptation that represents the very best translation possible from multiple translators. Because at the national level several participating countries deviated from the prescribed ideal procedures, Grisay was able to analyse and compare the effectiveness of different procedures. While language testing is not the focus of each article, the methodological advances and the practical work that is described will surely be relevant and interesting to researchers working in the language testing eld. ...as Stans eld suggests in his contribution to this special issue of Language Testing, it may be highly appropriate that the speci c expertise of language testers be involved to some degree in any test translation/adaptation activity.IV ReferencesHambleton, R.K. 2001:
Purpose The purposes of this study were to apply a bifactor model for the determination of test dimensionality and a multidimensional CAT using computer simulations of real data for the assessment of ...a new global physical health measure for children with cerebral palsy (CP). Methods Parent respondents of 306 children with cerebral palsy were recruited from four pediatric rehabilitation hospitals and outpatient clinics. We compared confirmatory factor analysis results across four models: (1) one-factor unidimensional; (2) two-factor multidimensional (MIRT); (3) bi-factor MIRT with fixed slopes; and (4) bi-factor MIRT with varied slopes. We tested whether the general and content (fatigue and pain) person score estimates could discriminate across severity and types of CP, and whether score estimates from a simulated CAT were similar to estimates based on the total item bank, and whether they correlated as expected with external measures. Results Confirmatory factor analysis suggested separate pain and fatigue sub-factors; all 37 items were retained in the analyses. From the bi-factor MIRT model with fixed slopes, the full item bank scores discriminated across levels of severity and types of CP, and compared favorably to external instruments. CAT scores based on 10-and 15-item versions accurately captured the global physical health scores. Conclusions The bi-factor MIRT CAT application, especially the 10-and 15-item versions, yielded accurate global physical health scores that discriminated across known severity groups and types of CP, and correlated as expected with concurrent measures. The CATs have potential for collecting complex data on the physical health of children with CP in an efficient manner.
Throughout the 40 year history of standardized patient assessments and OSCEs, there have been numerous advancements, including many that involve scoring the simulated clinical encounters. While there ...is no clear agreement on how examinees' performance should be documented or scored in an encounter, there is a consensus that several well-chosen SP encounters are required to produce reliable examinee scores. There also continues to be some debate as to who should do the scoring on an SP-based assessment. While logistics and cost will certainly play a role, it is probably best to use the person who is most familiar with the domain being assessed. In some instances this will be the SP; in others, an outside observer or content expert. Finally, with the growing use of OSCEs for summative purposes (e.g. certification, licensure), special attention must be paid to fairness issues. Since the same test form cannot be used day after day, examinee scores must be 'equated', taking into account the psychometric properties of scores from individual cases and individual SPs. To date, the CSA has been one of the highest-volume, high-stakes, standardized patient assessments to be developed and successfully administered. In 2003 alone, over 11 500 IMGs were tested. The early conceptual framework for this assessment was synthesized from the research endeavours of several notable individuals, including, amongst many others, Harden et al. 1975, Swanson & Stillman, 1990, Newble & Swanson, 1988, Vu et al. 1992 and Colliver, 1995. The early prototype administrations of the CSA, including many operational research studies, were supported and guided by Dr Friedman Ben-David, Friedman et al. 1991, 1993, Stillman et al. 1992, and Sutnick et al. 1993, 1995.
Celotno besedilo
Dostopno za:
DOBA, IJS, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Abstract Objective To develop and evaluate a prototype measure (OA-DISABILITY-CAT) for osteoarthritis research using item response theory (IRT) and computer-adaptive test (CAT) methodologies. Study ...Design and Setting We constructed an item bank consisting of 33 activities commonly affected by lower extremity (LE) osteoarthritis. A sample of 323 adults with LE osteoarthritis reported their degree of limitation in performing everyday activities, and completed the Health Assessment Questionnaire-II (HAQ-II). We used confirmatory factor analyses to assess scale unidimensionality and IRT methods to calibrate the items and examine the fit of the data. Using CAT simulation analyses, we examined the performance of OA-DISABILITY-CATs of different lengths compared with the full-item bank and the HAQ-II. Results One distinct disability domain was identified. The 10-item OA-DISABILITY-CAT demonstrated a high degree of accuracy compared with the full-item bank ( r = 0.99). The item bank and the HAQ-II scales covered a similar estimated scoring range. In terms of reliability, 95% of OA-DISABILITY reliability estimates were over 0.83 vs. 0.60 for the HAQ-II. Except at the highest scores, the 10-item OA-DISABILITY-CAT demonstrated superior precision to the HAQ-II. Conclusion The prototype OA-DISABILITY-CAT demonstrated promising measurement properties compared with the HAQ-II, and is recommended for use in LE osteoarthritis research.
The equal ability distribution assumption associated with the equivalent groups equating design was investigated in the context of a selection test for admission to higher education. The purpose was ...to assess the consequences for the test-takers in terms of receiving improperly high or low scores compared to their peers, and to find strong empirical evidence of potential violations of the assumption. Test-takers' scores on anchor items from two subtests were estimated using information about test-taker performance on the regular subtests. The results indicated that the anchor test item performance varied sufficiently, both in terms of means and spreads. Therefore, the equal ability distribution assumption could be questioned. Also, the estimated differences between different cohorts of test-takers are large enough to have an impact on the actual admissions decisions. Consequently, our conclusion is that more caution is needed when applying the equivalent groups design in the equating of tests. Assuming equal ability groups is a convenient assumption to make but it can also lead to systematic bias in the equating of test scores with potentially severe implications for test-takers, and this study provides a demonstration of this point.
Celotno besedilo
Dostopno za:
BFBNIB, DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, UILJ, UKNU, UL, UM, UPUK
The specific aims of this study were to (1) examine the psychometric properties (unidimensionality, differential item functioning, scale coverage) of an item bank of upper‐extremity skills for ...children and adolescents with cerebral palsy (CP); (2) evaluate a simulated computer‐adaptive test (CAT) using this item bank; (3) examine the concurrent validity of the CAT with the Pediatric Outcomes Data Collection Instrument (PODCI) upper‐extremity core scale; and (4) determine the discriminant validity of the simulated CAT with Manual Ability Classification System (MACS) levels and CP type (i.e. diplegia, hemiplegia, or quadriplegia). Parents (n=180) of children and adolescents with CP (spastic diplegia 49%, hemiplegia 22%, or quadriplegia 28%) consisting of 102 males and 78 females with a mean age of 10 years 6 months (SD 4y 1mo, range 2–21y), and MACS levels I through V participated in calibration of an item pool and completed the PODCI. Confirmatory factor analyses supported a unidimensional model using 49 of the 53 upper‐extremity items. Simulated CATs of 5, 10, and 15 items demonstrated excellent accuracy (intraclass correlation coefficient ICCs >0.93) with the full item bank, had high correlations with the PODCI upper‐extremity core scale score (ICC 0.79), and discriminated among MACS levels. The simulated CATs demonstrated excellent overall content coverage over a wide age span and severity of upper‐extremity involvement. The future development and refinement of CATs for parent report of physical function in children and adolescents with CP is supported by our work.
In summary, readers are encouraged to read carefully the IRT articles in this special issue. They provide theoretical as well as practical arguments in favor of using IRT models in the health ...outcomes measurement field. At the same time, readers should be aware that the IRT field is complex and software is limited and, in my judgment, not very user friendly (although some packages are better than others). In addition, sample sizes will often need to be larger than in classical measurement, at least with the more general IRT models, and applications are rarely straightforward. Considerable practical experience is needed to ensure successful applications of IRT in the development and validation of instruments for health outcomes measurement.