Building an Initial Validity Argument for Binary and Analytic Rating Scales for an EFL Classroom Writing Assessment: Evidence from Many-Facets Rasch Measurement

Main Article Content

Apichat Khamboonruang

Abstract

Although much research has compared the functioning between analytic and holistic rating scales, little research has compared the functioning of binary rating scales with other types of rating scales. This quantitative study set out to preliminarily and comparatively validate binary and analytic rating scales intended for use in formative assessment and for paragraph writing assessment in a Thai EFL university classroom context. Specifically, this study applied an argument-based validation approach to build an initial validity argument for the rating scales with emphasis on the evaluation, generalization, and explanation inferences, and employed a many-facets Rasch measurement (MFRM) approach to investigate the psychometric functionalities of the rating scales which served as the initial validity evidence for the rating scales. Three trained teacher raters applied the rating scales to rate the same set of 51 opinion paragraphs written by English-major students. The rating scores were analysed following the MFRM psychometrics. Overall, the MFRM results revealed that (1) the rating scales largely generated accurate writing scores, supporting the valuation inference, (2) the raters were self-consistent in applying the rating scales, contributing to the generalization inference, (3) the rating scales sufficiently captured the defined writing construct, substantiating the explanation inference, and (4) the binary rating scale showed more desirable psychometric properties than the analytic rating scale. The present findings confirm the appropriate functioning and reasonable validity argument of the rating scales and highlight the greater potential of the binary rating scale to mitigate rater inconsistency and cognitive load in a formative classroom assessment.

Article Details

How to Cite
Khamboonruang, A. (2022). Building an Initial Validity Argument for Binary and Analytic Rating Scales for an EFL Classroom Writing Assessment: Evidence from Many-Facets Rasch Measurement. REFLections, 29(3), 675–699. https://doi.org/10.61508/refl.v29i3.262690
Section
Research articles

References

American Education Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME) (2014). Standards for educational and psychological testing. American Educational Research Association.

Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54–74. https://doi.org/10.1080/15434300903464418

Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and rater performance. Assessment in Education: Principles, Policy & Practice, 18(3), 279–293. https://doi.org/10.1080/0969594X.2010.526585

Chapelle, C. A. (2021). Validity in language assessment. In P. Winke & T. Brunfaut (Eds.), The Routledge handbook of language testing (pp. 11–20). Routledge.

Chapelle, C. A., & Voss, E. (2021). Validity argument in language testing: Case studies of validation research. Cambridge University Press.

Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (2008). Building a validity argument for the Test of English as a Foreign Language. Routledge.

Eckes, T. (2012). Operational rater types in writing assessment: Linking rater cognition to rater behaviour. Language Assessment Quarterly, 9(3), 270–292. https://doi.org/10.1080/15434303.2011.649381

Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd ed.). Peter Lang.

Eckes, T. (2019). Many-facet Rasch measurement: Implications for rater-mediated language assessment. In V. Aryadoust & M. Raquel (Eds.), Quantitative data analysis for language assessment volume I: Fundamental techniques (pp. 152–175). Routledge.

Engelhard, J. G., & Wind, S. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Routledge.

Fulcher, G., Davidson, F., & Kemp, J. (2011). Effective rating scale development for speaking tests: Performance decision trees. Language Testing, 28(1), 5–29. https://doi.org/10.1177/0265532209359514

Ghalib, T. K., & Al-Hattami, A. A. (2015). Holistic versus analytic evaluation of EFL writing: A case study. English Language Teaching, 8(7), 225–236. http://dx.doi.org/10.5539/elt.v8n7p225

Han, T. (2017). Scores assigned by inexpert EFL raters to different quality EFL compositions, and the raters’ decision-making behaviors. International Journal of Progressive Education, 13(1), 136–152.

Harsch, C., & Martin, G. (2013). Comparing holistic and analytic scoring methods: Issues of validity and reliability. Assessment in Education: Principles, Policy & Practice, 20(3), 281–307. https://doi.org/10.1080/0969594X.2012.742422

Isbell, D. R. (2017). Assessing C2 writing ability on the certificate of English language proficiency: Rater and examinee age effects. Assessing Writing, 34, 37–49. http://dx.doi.org/10.1016/j.asw.2017.08.004

Jeong, H. (2017). Narrative and expository genre effects on students, raters, and performance criteria. Assessing Writing, 31, 113–125. https://doi.org/10.1016/j.asw.2016.08.006

Jeong, H. (2019). Writing scale effects on raters: An exploratory study. Language Testing in Asia, 9(20), 1–19. https://doi.org/10.1186/s40468-019-0097-4

Jiuliang, L. (2014). Examining genre effects on test takers’ summary writing performance. Assessing Writing, 22, 75–90. http://dx.doi.org/10.1016/j.asw.2014.08.003

Jönsson, A., Balan, A., & Hartell, E. (2021). Analytic or holistic? A study about how to increase the agreement in teachers’ grading. Assessment in Education: Principles, Policy & Practice, 28(3), 212–227. https://doi.org/10.1080/0969594X.2021.1884041

Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000

Kane, M. T. (2021). Articulating a validity argument. In G. Fulcher & L. Harding (Eds.), The Routledge handbook of language testing (2nd ed., pp. 32–47). Routledge.

Khamboonruang, A. (2020). Development and validation of a diagnostic rating scale for formative assessment in a Thai EFL university writing classroom: A mixed methods study [Doctoral dissertation, The University of Melbourne]. Minerva Access. http://hdl.handle.net/11343/252672

Kim, Y.-H. (2010). An argument-based validity inquiry into the empirically-derived descriptor-based diagnostic (EDD) assessment in ESL academic writing [Doctoral dissertation, The University of Toronto]. TSpace. https://hdl.handle.net/1807/24786

Knoch, U. (2016). Validation of writing assessment. In C. A. Chapelle (Ed.), Encyclopedia of applied linguistics (pp. 1–6). Blackwell Publishing Ltd. https://doi.org/10.1002/9781405198431.wbeal1480

Knoch, U. (2021). Assessing writing. In G. Fulcher & L. Harding (Eds.), The Routledge handbook of language testing (2nd ed., pp. 236–253). Routledge.

Knoch, U., & Chapelle, C. A. (2018). Validation of rating processes within an argument-based framework. Language Testing, 35(4), 477–499. https://doi.org/10.1177/0265532217710049

Knoch, U., Deygers, B., & Khamboonruang, A. (2021). Revisiting rating scale development for rater-mediated language performance assessments: Modelling construct and contextual choices made by scale developers. Language Testing, 38(4), 602–626. https://doi.org/10.1177/0265532221994052

Knoch, U., Fairbairn, J., & Jin, Y. (2021). Scoring second language spoken and written performance: Issues, options, and directions. Equinox.

Lamprianou, I., Tsagari, D., & Kyriakou, N. (2021). The longitudinal stability of rating characteristics in an EFL examination: Methodological and substantive considerations. Language Testing, 38(2), 273–301. https://doi.org/10.1177/0265532220940960

Li, W. (2022). Scoring rubric reliability and internal validity in rater-mediated EFL writing assessment: Insights from many-facet Rasch measurement. Read and Writing. https://doi.org/10.1007/s11145-022-10279-1

Linacre, J. M. (1989). Many-facet Rasch measurement. MESA Press.

Linacre, J. M. (2004). Optimizing rating scale category effectiveness. In E. V. Smith & R. M. Smith (Eds.), Introduction to Rasch measurement: Theory, models and applications (pp. 258–278). JAM Press.

Linacre, J. M. (2022). Facets computer program for many-facet Rasch measurement, version 3.84.0. Winsteps.com.

Lukácsi, Z. (2021). Developing a level-specific checklist for assessing EFL writing. Language Testing, 38(1), 86–105. https://doi.org/10.1177/0265532220916703

Mahshanian, A., Eslami, A., & Ketabi, S. (2017). Raters’ fatigue and their comments during scoring writing essays: A case of Iranian EFL learners. Indonesian Journal of Applied Linguistics, 7(2), 302–314. https://doi.org/10.17509/ijal.v7i2.8347

Mendoza, A., & Knoch, U. (2018). Examining the validity of an analytic rating scale for a Spanish test for academic purposes using the argument-based approach to validation. Assessing Writing, 35, 41-55. https://doi.org/10.1016/j.asw.2017.12.003

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). Macmillan.

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. http://dx.doi.org/10.1037/0003-066X.50.9.741

Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.

Park, H., & Yan, X. (2019). An investigation into rater performance with a holistic scale and a binary, analytic scale on an ESL writing placement test. Papers in Language Testing and Assessment, 8(2), 34–64.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. The Danish Institute of Educational Research.

Şahan, Ö., & Razı, S. (2020). Do experience and text quality matter for raters’ decision-making behaviors? Language Testing, 37(3), 311–332. https://doi.org/10.1177/0265532219900228

Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language tests. ELT Journal, 49(1), 3–12. https://doi.org/10.1093/elt/49.1.3

Wagner, M. (2015). The centrality of cognitively diagnostic assessment for advancing secondary school ESL students’ writing: A mixed methods study [Doctoral dissertation, The University of Toronto]. TSpace. https://hdl.handle.net/1807/69530

Weigle, S. C. (2002). Assessing writing. Cambridge University Press.

Wiseman, C. S. (2012). A comparison of the performance of analytic vs. holistic scoring rubrics to assess L2 writing. Iranian Journal of Language Testing, 2(1), 59–92.

Yan, X., & Chuang, P.-L. (2022). How do raters learn to rate? Many-facet Rasch modeling of rater performance over the course of a rater certification program. Language Testing. https://doi.org/10.1177/02655322221074913

Zhu, Y., Fung, A. S. L., & Yang, L. (2021). A methodologically improved study on raters’ personality and rating severity in writing assessment. SAGE Open, 1–16. https://doi.org/10.1177/21582440211009476