Comparing Estimated and Real Item Difficulty Using Multi-Facet Rasch Analysis

Ayfer Sayın; Sebahat Gören

doi:10.21031/epod.1310893

Research Article

Comparing Estimated and Real Item Difficulty Using Multi-Facet Rasch Analysis

Year 2023, Volume: 14 Issue: 4, 440 - 454, 31.12.2023

Ayfer Sayın , Sebahat Gören

https://doi.org/10.21031/epod.1310893

Abstract

This study aimed to compare estimated item difficulty based on expert opinion with real item difficulty based on data. For security reasons, some high-stakes tests are not pre-tested and item difficulty is estimated by teachers in classroom assessments, so it is necessary to examine the extent to which experts make accurate predictions. In this study, we developed a 12-item assessment test like the Turkish teacher certification exam. Item difficulty was estimated and compared separately based on 1165 student responses and the opinions of 12 experts. The study revealed that the experts had a good ability to estimate item difficulty for items of moderate difficulty. However, they tended to underestimate item difficulty for items categorized as medium-easy.

Keywords

test development, item difficulty, subject matter experts, multi-facet Rasch

Supporting Institution

Project Number

Thanks

This study was presented at the 7th International Congress on Measurement and Evaluation in Education and Psychology (September 1-4, 2021, Ankara/Turkey).

References

Afrashteh, M. Y. (2021). Comparison of the validity of bookmark and Angoff standard setting methods in medical performance tests. Bmc Medical Education, 21(1). https://doi.org/10.1186/s12909-020-02436-3
AITSL, A. I. f. T. a. S. L. (2022). AITSL, Australian Professional Standards for Teachers. https://www.aitsl.edu.au/tools-resources/resource/australian-professional-standards-for-teachers
Attali, Y., Saldivia, L., Jackson, C., Schuppan, F., & Wanamaker, W. (2014). Estimating item difficulty with comparative judgments. ETS Research Report Series, 2014(2), 1-8. http://dx.doi.org/10.1002/ets2.12042
Beinborn, L., Zesch, T., & Gurevych, I. (2014). Predicting the difficulty of language proficiency tests. Transactions of the Association for Computational Linguistics, 2, 517-530. https://doi.org/10.1162/tacl_a_00200
Chon, Y. V., & Shin, T. (2010). Item difficulty predictors of a multiple-choice reading test. English Teaching, 65(4), 257-282. http://journal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf
Clauser, J. C., Hambleton, R. K., & Baldwin, P. (2017). The Effect of Rating Unfamiliar Items on Angoff Passing Scores.Educational and Psychological Measurement, 77(6), 901-916. https://doi.org/10.1177/0013164416670983
Ell, F. (2021). Teacher education policy in Aotearoa New Zealand: Global trends meet local imperatives. In Teacher Education Policy and Research: Global Perspectives (pp. 113-128). Springer.
Enright, M. K., Allen, N., & Kim, M. I. (1993). A Complexity Analysis of Items from a Survey of Academic Achievement in the Life Sciences. ETS Research Report Series, 1993(1), i-32. https://files.eric.ed.gov/fulltext/ED385595.pdf
Fergadiotis, G., Swiderski, A., & Hula, W. D. (2019). Predicting confrontation naming item difficulty. Aphasiology, 33(6), 689-709. https://doi.org/10.1080/02687038.2018.1495310
Gitomer, D. H., & Qi, Y. (2010). Recent Trends in Mean Scores and Characteristics of Test-Takers on" Praxis II" Licensure Tests. Office of Planning, Evaluation and Policy Development, US Department of Education.
Grivokostopoulou, F., Hatzilygeroudis, I., & Perikos, I. (2014). Teaching assistance and automatic difficulty estimation in converting first order logic to clause form. Artificial Intelligence Review, 42, 347-367. http://dx.doi.org/10.1007/s10462-013-9417-8
Hamamoto Filho, P. T., Silva, E., Ribeiro, Z. M. T., Hafner, M. d. L. M. B., Cecilio-Fernandes, D., & Bicudo, A. M. (2020). Relationships between Bloom’s taxonomy, judges’ estimation of item difficulty and psychometric properties of items from a progress test: a prospective observational study. Sao Paulo Medical Journal, 138, 33-39. http://dx.doi.org/10.1590/1516-3180.2019.0459.R1.19112019
He, J., Peng, L., Sun, B., Yu, L. J., & Zhang, Y. H. (2021). Automatically Predict Question Difficulty for Reading Comprehension Exercises. 2021 Ieee 33rd International Conference on Tools with Artificial Intelligence (Ictai 2021), 1398-1402. https://doi.org/10.1109/Ictai52525.2021.00222
Impara, J. C., & Plake, B. S. (1998). Teachers' ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35(1), 69-81. https://psycnet.apa.org/doi/10.1111/j.1745-3984.1998.tb00528.x
Kardong-Edgren, S., & Mulcock, P. M. (2016). Angoff Method of Setting Cut Scores for High-Stakes Testing Foley Catheter Checkoff as an Exemplar. Nurse Educator, 41(2), 80-82. https://doi.org/10.1097/Nne.0000000000000218
Kibble, J. D., & Johnson, T. (2011). Are faculty predictions or item taxonomies useful for estimating the outcome of multiple-choice examinations? Advances in physiology education, 35(4), 396-401. https://doi.org/10.1152/advan.00062.2011
Kurdi, G., Leo, J., Matentzoglu, N., Parsia, B., Sattler, U., Forge, S., Donato, G., & Dowling, W. (2021). A comparative study of methods for a priori prediction of MCQ difficulty. Semantic Web, 12(3), 449-465. https://doi.org/10.3233/Sw-200390
Le Hebel, F., Tiberghien, A., Montpied, P., & Fontanieu, V. (2019). Teacher prediction of student difficulties while solving a science inquiry task: example of PISA science items. International Journal of Science Education, 41(11), 1517-1540. https://doi.org/10.1080/09500693.2019.1615150
Lin, C.-S., Lu, Y.-L., & Lien, C.-J. (2021). Association between Test Item's Length, Difficulty, and Students' Perceptions: Machine Learning in Schools' Term Examinations. Universal Journal of Educational Research, 9(6), 1323-1332. http://dx.doi.org/10.13189/ujer.2021.090622
Lin, L. H., Chang, T. H., & Hsu, F. Y. (2019). Automated Prediction of Item Difficulty in Reading Comprehension Using Long Short-Term Memory. Proceedings of the 2019 International Conference on Asian Language Processing (IALP), Shanghai, China, 132-135. https://doi.org/10.1109/IALP48816.2019.9037716.
Linacre, J.M. (2014). A user's guide to FACETS Rasch-model computer programs. Retrieved from http://www.winsteps.com/a/facets-manual.pdf
Lumley, T., Routitsky, A., Mendelovits, J., & Ramalingam, D. (2012). A framework for predicting item difficulty in reading tests.
OSYM. (2022). KPSS: Kamu Personel Seçme Sınavı. https://www.osym.gov.tr/TR,23892/2022-kpss-lisans-genel-yetenek-genel-kultur-ve-egitim-bilimleri-oturumlarinin-temel-soru-kitapciklari-ve-cevap-anahtarlari-yayimlandi-31072022.html
Pandarova, I., Schmidt, T., Hartig, J., Boubekki, A., Jones, R. D., & Brefeld, U. (2019). Predicting the difficulty of exercise items for dynamic difficulty adaptation in adaptive language tutoring. International Journal of Artificial Intelligence in Education, 29, 342-367. https://doi.org/10.1007/s40593-019-00180-4
Perikos, I., Grivokostopoulou, F., Kovas, K., & Hatzilygeroudis, I. (2016). Automatic estimation of exercises' item difficulty in a tutoring system for teaching the conversion of natural language into first‐order logic. Expert Systems, 33(6), 569-580. https://doi.org/10.1111/exsy.12182
Perkins, K., Gupta, L., & Tammana, R. (1995). Predicting item difficulty in a reading comprehension test with an artificial neural network. Language testing, 12(1), 34-53.https://doi.org/10.1177/026553229501200103
Praxis, E. T. S. (2022). ETS, The Praxis Tests. https://www.ets.org/praxis
Qiu, Z. P., Wu, X., & Fan, W. (2019). Question difficulty prediction for multiple choice problems in medical exams. Proceedings of the 28th Acm International Conference on Information & Knowledge Management (Cikm '19), 139-148. https://doi.org/10.1145/3357384.3358013
Sano, M. (2015). Automated capturing of psycho-linguistic features in reading assessment text. Annual meeting of the National Council on Measurement in Education, Chicago, IL,
Schult, J., & Lindner, M. A. (2018). Judgment Accuracy of German Elementary School Teachers: A Matter of Response Formats? German Journal of Educational Psychology, 32(1-2), 75-87. https://doi.org/10.1024/1010-0652/a000216
Stadler, M., Niepel, C., & Greiff, S. (2016). Easily too difficult: Estimating item difficulty in computer simulated microworlds. Computers in Human Behavior, 65, 100-106. https://doi.org/10.1016/j.chb.2016.08.025
Sydorenko, T. (2011). Item writer judgments of item difficulty versus real item difficulty: A case study. Language Assessment Quarterly, 8(1), 34-52. https://doi.org/10.1080/15434303.2010.536924
Toyama, Y. (2021). What makes reading difficult? An Investigation of the contributions of passage, task, and reader characteristics on comprehension performance. Reading Research Quarterly, 56(4), 633-642. https://doi.org/10.1002/rrq.440
Urhahne, D., & Wijnia, L. (2021). A review on the accuracy of teacher judgments. Educational Research Review, 32, 100374. https://doi.org/10.1016/j.edurev.2020.100374
Wauters, K., Desmet, P., & Van Den Noortgate, W. (2012). Item difficulty estimation: An auspicious collaboration between data and judgment. Computers & Education, 58(4), 1183-1193. https://doi.org/10.1016/j.compedu.2011.11.020
Wyse, A. E. (2018). Equating angoff standard-setting ratings with the rasch model.Measurement-Interdisciplinary Research and Perspectives, 16(3), 181-194. https://doi.org/10.1080/15366367.2018.1483170
Wyse, A. E. (2020). Comparing cut scores from the angoff method and two variations of the hofstee and beuk methods. Applied Measurement in Education, 33(2), 159-173. https://doi.org/10.1080/08957347.2020.1732385
Yaneva, V., Ha, L. A., Baldwin, P., & Mee, J. (2020, May). Predicting item survival for multiple choice questions in a high-stakes medical exam. Proceedings of the 12th International Conference on Language Resources and Evaluation (Lrec), 6812-6818. Marseille, France. https://aclanthology.org/2020.lrec-1.841.pdf
Yim, M. K., & Shin, S. J. (2020). Using the Angoff method to set a standard on mock exams for the Korean Nursing Licensing Examination. Journal of Educational Evaluation for Health Professions, 17(4). https://doi.org/10.3352/jeehp.2020.17.14

Year 2023, Volume: 14 Issue: 4, 440 - 454, 31.12.2023

Ayfer Sayın , Sebahat Gören

https://doi.org/10.21031/epod.1310893

Abstract

Project Number

References

Afrashteh, M. Y. (2021). Comparison of the validity of bookmark and Angoff standard setting methods in medical performance tests. Bmc Medical Education, 21(1). https://doi.org/10.1186/s12909-020-02436-3
AITSL, A. I. f. T. a. S. L. (2022). AITSL, Australian Professional Standards for Teachers. https://www.aitsl.edu.au/tools-resources/resource/australian-professional-standards-for-teachers
Attali, Y., Saldivia, L., Jackson, C., Schuppan, F., & Wanamaker, W. (2014). Estimating item difficulty with comparative judgments. ETS Research Report Series, 2014(2), 1-8. http://dx.doi.org/10.1002/ets2.12042
Beinborn, L., Zesch, T., & Gurevych, I. (2014). Predicting the difficulty of language proficiency tests. Transactions of the Association for Computational Linguistics, 2, 517-530. https://doi.org/10.1162/tacl_a_00200
Chon, Y. V., & Shin, T. (2010). Item difficulty predictors of a multiple-choice reading test. English Teaching, 65(4), 257-282. http://journal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf
Clauser, J. C., Hambleton, R. K., & Baldwin, P. (2017). The Effect of Rating Unfamiliar Items on Angoff Passing Scores.Educational and Psychological Measurement, 77(6), 901-916. https://doi.org/10.1177/0013164416670983
Ell, F. (2021). Teacher education policy in Aotearoa New Zealand: Global trends meet local imperatives. In Teacher Education Policy and Research: Global Perspectives (pp. 113-128). Springer.
Enright, M. K., Allen, N., & Kim, M. I. (1993). A Complexity Analysis of Items from a Survey of Academic Achievement in the Life Sciences. ETS Research Report Series, 1993(1), i-32. https://files.eric.ed.gov/fulltext/ED385595.pdf
Fergadiotis, G., Swiderski, A., & Hula, W. D. (2019). Predicting confrontation naming item difficulty. Aphasiology, 33(6), 689-709. https://doi.org/10.1080/02687038.2018.1495310
Gitomer, D. H., & Qi, Y. (2010). Recent Trends in Mean Scores and Characteristics of Test-Takers on" Praxis II" Licensure Tests. Office of Planning, Evaluation and Policy Development, US Department of Education.
Grivokostopoulou, F., Hatzilygeroudis, I., & Perikos, I. (2014). Teaching assistance and automatic difficulty estimation in converting first order logic to clause form. Artificial Intelligence Review, 42, 347-367. http://dx.doi.org/10.1007/s10462-013-9417-8
Hamamoto Filho, P. T., Silva, E., Ribeiro, Z. M. T., Hafner, M. d. L. M. B., Cecilio-Fernandes, D., & Bicudo, A. M. (2020). Relationships between Bloom’s taxonomy, judges’ estimation of item difficulty and psychometric properties of items from a progress test: a prospective observational study. Sao Paulo Medical Journal, 138, 33-39. http://dx.doi.org/10.1590/1516-3180.2019.0459.R1.19112019
He, J., Peng, L., Sun, B., Yu, L. J., & Zhang, Y. H. (2021). Automatically Predict Question Difficulty for Reading Comprehension Exercises. 2021 Ieee 33rd International Conference on Tools with Artificial Intelligence (Ictai 2021), 1398-1402. https://doi.org/10.1109/Ictai52525.2021.00222
Impara, J. C., & Plake, B. S. (1998). Teachers' ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35(1), 69-81. https://psycnet.apa.org/doi/10.1111/j.1745-3984.1998.tb00528.x
Kardong-Edgren, S., & Mulcock, P. M. (2016). Angoff Method of Setting Cut Scores for High-Stakes Testing Foley Catheter Checkoff as an Exemplar. Nurse Educator, 41(2), 80-82. https://doi.org/10.1097/Nne.0000000000000218
Kibble, J. D., & Johnson, T. (2011). Are faculty predictions or item taxonomies useful for estimating the outcome of multiple-choice examinations? Advances in physiology education, 35(4), 396-401. https://doi.org/10.1152/advan.00062.2011
Kurdi, G., Leo, J., Matentzoglu, N., Parsia, B., Sattler, U., Forge, S., Donato, G., & Dowling, W. (2021). A comparative study of methods for a priori prediction of MCQ difficulty. Semantic Web, 12(3), 449-465. https://doi.org/10.3233/Sw-200390
Le Hebel, F., Tiberghien, A., Montpied, P., & Fontanieu, V. (2019). Teacher prediction of student difficulties while solving a science inquiry task: example of PISA science items. International Journal of Science Education, 41(11), 1517-1540. https://doi.org/10.1080/09500693.2019.1615150
Lin, C.-S., Lu, Y.-L., & Lien, C.-J. (2021). Association between Test Item's Length, Difficulty, and Students' Perceptions: Machine Learning in Schools' Term Examinations. Universal Journal of Educational Research, 9(6), 1323-1332. http://dx.doi.org/10.13189/ujer.2021.090622
Lin, L. H., Chang, T. H., & Hsu, F. Y. (2019). Automated Prediction of Item Difficulty in Reading Comprehension Using Long Short-Term Memory. Proceedings of the 2019 International Conference on Asian Language Processing (IALP), Shanghai, China, 132-135. https://doi.org/10.1109/IALP48816.2019.9037716.
Linacre, J.M. (2014). A user's guide to FACETS Rasch-model computer programs. Retrieved from http://www.winsteps.com/a/facets-manual.pdf
Lumley, T., Routitsky, A., Mendelovits, J., & Ramalingam, D. (2012). A framework for predicting item difficulty in reading tests.
OSYM. (2022). KPSS: Kamu Personel Seçme Sınavı. https://www.osym.gov.tr/TR,23892/2022-kpss-lisans-genel-yetenek-genel-kultur-ve-egitim-bilimleri-oturumlarinin-temel-soru-kitapciklari-ve-cevap-anahtarlari-yayimlandi-31072022.html
Pandarova, I., Schmidt, T., Hartig, J., Boubekki, A., Jones, R. D., & Brefeld, U. (2019). Predicting the difficulty of exercise items for dynamic difficulty adaptation in adaptive language tutoring. International Journal of Artificial Intelligence in Education, 29, 342-367. https://doi.org/10.1007/s40593-019-00180-4
Perikos, I., Grivokostopoulou, F., Kovas, K., & Hatzilygeroudis, I. (2016). Automatic estimation of exercises' item difficulty in a tutoring system for teaching the conversion of natural language into first‐order logic. Expert Systems, 33(6), 569-580. https://doi.org/10.1111/exsy.12182
Perkins, K., Gupta, L., & Tammana, R. (1995). Predicting item difficulty in a reading comprehension test with an artificial neural network. Language testing, 12(1), 34-53.https://doi.org/10.1177/026553229501200103
Praxis, E. T. S. (2022). ETS, The Praxis Tests. https://www.ets.org/praxis
Qiu, Z. P., Wu, X., & Fan, W. (2019). Question difficulty prediction for multiple choice problems in medical exams. Proceedings of the 28th Acm International Conference on Information & Knowledge Management (Cikm '19), 139-148. https://doi.org/10.1145/3357384.3358013
Sano, M. (2015). Automated capturing of psycho-linguistic features in reading assessment text. Annual meeting of the National Council on Measurement in Education, Chicago, IL,
Schult, J., & Lindner, M. A. (2018). Judgment Accuracy of German Elementary School Teachers: A Matter of Response Formats? German Journal of Educational Psychology, 32(1-2), 75-87. https://doi.org/10.1024/1010-0652/a000216
Stadler, M., Niepel, C., & Greiff, S. (2016). Easily too difficult: Estimating item difficulty in computer simulated microworlds. Computers in Human Behavior, 65, 100-106. https://doi.org/10.1016/j.chb.2016.08.025
Sydorenko, T. (2011). Item writer judgments of item difficulty versus real item difficulty: A case study. Language Assessment Quarterly, 8(1), 34-52. https://doi.org/10.1080/15434303.2010.536924
Toyama, Y. (2021). What makes reading difficult? An Investigation of the contributions of passage, task, and reader characteristics on comprehension performance. Reading Research Quarterly, 56(4), 633-642. https://doi.org/10.1002/rrq.440
Urhahne, D., & Wijnia, L. (2021). A review on the accuracy of teacher judgments. Educational Research Review, 32, 100374. https://doi.org/10.1016/j.edurev.2020.100374
Wauters, K., Desmet, P., & Van Den Noortgate, W. (2012). Item difficulty estimation: An auspicious collaboration between data and judgment. Computers & Education, 58(4), 1183-1193. https://doi.org/10.1016/j.compedu.2011.11.020
Wyse, A. E. (2018). Equating angoff standard-setting ratings with the rasch model.Measurement-Interdisciplinary Research and Perspectives, 16(3), 181-194. https://doi.org/10.1080/15366367.2018.1483170
Wyse, A. E. (2020). Comparing cut scores from the angoff method and two variations of the hofstee and beuk methods. Applied Measurement in Education, 33(2), 159-173. https://doi.org/10.1080/08957347.2020.1732385
Yaneva, V., Ha, L. A., Baldwin, P., & Mee, J. (2020, May). Predicting item survival for multiple choice questions in a high-stakes medical exam. Proceedings of the 12th International Conference on Language Resources and Evaluation (Lrec), 6812-6818. Marseille, France. https://aclanthology.org/2020.lrec-1.841.pdf
Yim, M. K., & Shin, S. J. (2020). Using the Angoff method to set a standard on mock exams for the Korean Nursing Licensing Examination. Journal of Educational Evaluation for Health Professions, 17(4). https://doi.org/10.3352/jeehp.2020.17.14

There are 39 citations in total.

Details

Primary Language	English
Journal Section	Articles
Authors	Ayfer Sayın 0000-0003-1357-5674 Sebahat Gören 0000-0002-6453-3258
Project Number	-
Publication Date	December 31, 2023
Acceptance Date	October 20, 2023
Published in Issue	Year 2023 Volume: 14 Issue: 4

Cite

APA	Sayın, A., & Gören, S. (2023). Comparing Estimated and Real Item Difficulty Using Multi-Facet Rasch Analysis. Journal of Measurement and Evaluation in Education and Psychology, 14(4), 440-454. https://doi.org/10.21031/epod.1310893

Download Cover Image

Article Files

Full Text