Comparing Estimated and Real Item Difficulty Using Multi-Facet Rasch Analysis
Yıl 2023,
Cilt: 14 Sayı: 4, 440 - 454, 31.12.2023
Ayfer Sayın
,
Sebahat Gören
Öz
This study aimed to compare estimated item difficulty based on expert opinion with real item difficulty based on data. For security reasons, some high-stakes tests are not pre-tested and item difficulty is estimated by teachers in classroom assessments, so it is necessary to examine the extent to which experts make accurate predictions. In this study, we developed a 12-item assessment test like the Turkish teacher certification exam. Item difficulty was estimated and compared separately based on 1165 student responses and the opinions of 12 experts. The study revealed that the experts had a good ability to estimate item difficulty for items of moderate difficulty. However, they tended to underestimate item difficulty for items categorized as medium-easy.
Teşekkür
This study was presented at the 7th International Congress on Measurement and Evaluation in Education and Psychology (September 1-4, 2021, Ankara/Turkey).
Kaynakça
- Afrashteh, M. Y. (2021). Comparison of the validity of bookmark and Angoff standard setting methods in medical performance tests. Bmc Medical Education, 21(1). https://doi.org/10.1186/s12909-020-02436-3
- AITSL, A. I. f. T. a. S. L. (2022). AITSL, Australian Professional Standards for Teachers. https://www.aitsl.edu.au/tools-resources/resource/australian-professional-standards-for-teachers
- Attali, Y., Saldivia, L., Jackson, C., Schuppan, F., & Wanamaker, W. (2014). Estimating item difficulty with comparative judgments. ETS Research Report Series, 2014(2), 1-8. http://dx.doi.org/10.1002/ets2.12042
- Beinborn, L., Zesch, T., & Gurevych, I. (2014). Predicting the difficulty of language proficiency tests. Transactions of the Association for Computational Linguistics, 2, 517-530. https://doi.org/10.1162/tacl_a_00200
- Chon, Y. V., & Shin, T. (2010). Item difficulty predictors of a multiple-choice reading test. English Teaching, 65(4), 257-282. http://journal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf
- Clauser, J. C., Hambleton, R. K., & Baldwin, P. (2017). The Effect of Rating Unfamiliar Items on Angoff Passing Scores.Educational and Psychological Measurement, 77(6), 901-916. https://doi.org/10.1177/0013164416670983
- Ell, F. (2021). Teacher education policy in Aotearoa New Zealand: Global trends meet local imperatives. In Teacher Education Policy and Research: Global Perspectives (pp. 113-128). Springer.
- Enright, M. K., Allen, N., & Kim, M. I. (1993). A Complexity Analysis of Items from a Survey of Academic Achievement in the Life Sciences. ETS Research Report Series, 1993(1), i-32. https://files.eric.ed.gov/fulltext/ED385595.pdf
- Fergadiotis, G., Swiderski, A., & Hula, W. D. (2019). Predicting confrontation naming item difficulty. Aphasiology, 33(6), 689-709. https://doi.org/10.1080/02687038.2018.1495310
- Gitomer, D. H., & Qi, Y. (2010). Recent Trends in Mean Scores and Characteristics of Test-Takers on" Praxis II" Licensure Tests. Office of Planning, Evaluation and Policy Development, US Department of Education.
- Grivokostopoulou, F., Hatzilygeroudis, I., & Perikos, I. (2014). Teaching assistance and automatic difficulty estimation in converting first order logic to clause form. Artificial Intelligence Review, 42, 347-367. http://dx.doi.org/10.1007/s10462-013-9417-8
- Hamamoto Filho, P. T., Silva, E., Ribeiro, Z. M. T., Hafner, M. d. L. M. B., Cecilio-Fernandes, D., & Bicudo, A. M. (2020). Relationships between Bloom’s taxonomy, judges’ estimation of item difficulty and psychometric properties of items from a progress test: a prospective observational study. Sao Paulo Medical Journal, 138, 33-39. http://dx.doi.org/10.1590/1516-3180.2019.0459.R1.19112019
- He, J., Peng, L., Sun, B., Yu, L. J., & Zhang, Y. H. (2021). Automatically Predict Question Difficulty for Reading Comprehension Exercises. 2021 Ieee 33rd International Conference on Tools with Artificial Intelligence (Ictai 2021), 1398-1402. https://doi.org/10.1109/Ictai52525.2021.00222
- Impara, J. C., & Plake, B. S. (1998). Teachers' ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35(1), 69-81. https://psycnet.apa.org/doi/10.1111/j.1745-3984.1998.tb00528.x
- Kardong-Edgren, S., & Mulcock, P. M. (2016). Angoff Method of Setting Cut Scores for High-Stakes Testing Foley Catheter Checkoff as an Exemplar. Nurse Educator, 41(2), 80-82. https://doi.org/10.1097/Nne.0000000000000218
- Kibble, J. D., & Johnson, T. (2011). Are faculty predictions or item taxonomies useful for estimating the outcome of multiple-choice examinations? Advances in physiology education, 35(4), 396-401. https://doi.org/10.1152/advan.00062.2011
- Kurdi, G., Leo, J., Matentzoglu, N., Parsia, B., Sattler, U., Forge, S., Donato, G., & Dowling, W. (2021). A comparative study of methods for a priori prediction of MCQ difficulty. Semantic Web, 12(3), 449-465. https://doi.org/10.3233/Sw-200390
- Le Hebel, F., Tiberghien, A., Montpied, P., & Fontanieu, V. (2019). Teacher prediction of student difficulties while solving a science inquiry task: example of PISA science items. International Journal of Science Education, 41(11), 1517-1540. https://doi.org/10.1080/09500693.2019.1615150
- Lin, C.-S., Lu, Y.-L., & Lien, C.-J. (2021). Association between Test Item's Length, Difficulty, and Students' Perceptions: Machine Learning in Schools' Term Examinations. Universal Journal of Educational Research, 9(6), 1323-1332. http://dx.doi.org/10.13189/ujer.2021.090622
- Lin, L. H., Chang, T. H., & Hsu, F. Y. (2019). Automated Prediction of Item Difficulty in Reading Comprehension Using Long Short-Term Memory. Proceedings of the 2019 International Conference on Asian Language Processing (IALP), Shanghai, China, 132-135. https://doi.org/10.1109/IALP48816.2019.9037716.
- Linacre, J.M. (2014). A user's guide to FACETS Rasch-model computer programs. Retrieved from http://www.winsteps.com/a/facets-manual.pdf
- Lumley, T., Routitsky, A., Mendelovits, J., & Ramalingam, D. (2012). A framework for predicting item difficulty in reading tests.
- OSYM. (2022). KPSS: Kamu Personel Seçme Sınavı. https://www.osym.gov.tr/TR,23892/2022-kpss-lisans-genel-yetenek-genel-kultur-ve-egitim-bilimleri-oturumlarinin-temel-soru-kitapciklari-ve-cevap-anahtarlari-yayimlandi-31072022.html
- Pandarova, I., Schmidt, T., Hartig, J., Boubekki, A., Jones, R. D., & Brefeld, U. (2019). Predicting the difficulty of exercise items for dynamic difficulty adaptation in adaptive language tutoring. International Journal of Artificial Intelligence in Education, 29, 342-367. https://doi.org/10.1007/s40593-019-00180-4
- Perikos, I., Grivokostopoulou, F., Kovas, K., & Hatzilygeroudis, I. (2016). Automatic estimation of exercises' item difficulty in a tutoring system for teaching the conversion of natural language into first‐order logic. Expert Systems, 33(6), 569-580. https://doi.org/10.1111/exsy.12182
- Perkins, K., Gupta, L., & Tammana, R. (1995). Predicting item difficulty in a reading comprehension test with an artificial neural network. Language testing, 12(1), 34-53.https://doi.org/10.1177/026553229501200103
- Praxis, E. T. S. (2022). ETS, The Praxis Tests. https://www.ets.org/praxis
- Qiu, Z. P., Wu, X., & Fan, W. (2019). Question difficulty prediction for multiple choice problems in medical exams. Proceedings of the 28th Acm International Conference on Information & Knowledge Management (Cikm '19), 139-148. https://doi.org/10.1145/3357384.3358013
- Sano, M. (2015). Automated capturing of psycho-linguistic features in reading assessment text. Annual meeting of the National Council on Measurement in Education, Chicago, IL,
- Schult, J., & Lindner, M. A. (2018). Judgment Accuracy of German Elementary School Teachers: A Matter of Response Formats? German Journal of Educational Psychology, 32(1-2), 75-87. https://doi.org/10.1024/1010-0652/a000216
- Stadler, M., Niepel, C., & Greiff, S. (2016). Easily too difficult: Estimating item difficulty in computer simulated microworlds. Computers in Human Behavior, 65, 100-106. https://doi.org/10.1016/j.chb.2016.08.025
- Sydorenko, T. (2011). Item writer judgments of item difficulty versus real item difficulty: A case study. Language Assessment Quarterly, 8(1), 34-52. https://doi.org/10.1080/15434303.2010.536924
- Toyama, Y. (2021). What makes reading difficult? An Investigation of the contributions of passage, task, and reader characteristics on comprehension performance. Reading Research Quarterly, 56(4), 633-642. https://doi.org/10.1002/rrq.440
- Urhahne, D., & Wijnia, L. (2021). A review on the accuracy of teacher judgments. Educational Research Review, 32, 100374. https://doi.org/10.1016/j.edurev.2020.100374
- Wauters, K., Desmet, P., & Van Den Noortgate, W. (2012). Item difficulty estimation: An auspicious collaboration between data and judgment. Computers & Education, 58(4), 1183-1193. https://doi.org/10.1016/j.compedu.2011.11.020
- Wyse, A. E. (2018). Equating angoff standard-setting ratings with the rasch model.Measurement-Interdisciplinary Research and Perspectives, 16(3), 181-194. https://doi.org/10.1080/15366367.2018.1483170
- Wyse, A. E. (2020). Comparing cut scores from the angoff method and two variations of the hofstee and beuk methods. Applied Measurement in Education, 33(2), 159-173. https://doi.org/10.1080/08957347.2020.1732385
- Yaneva, V., Ha, L. A., Baldwin, P., & Mee, J. (2020, May). Predicting item survival for multiple choice questions in a high-stakes medical exam. Proceedings of the 12th International Conference on Language Resources and Evaluation (Lrec), 6812-6818. Marseille, France. https://aclanthology.org/2020.lrec-1.841.pdf
- Yim, M. K., & Shin, S. J. (2020). Using the Angoff method to set a standard on mock exams for the Korean Nursing Licensing Examination. Journal of Educational Evaluation for Health Professions, 17(4). https://doi.org/10.3352/jeehp.2020.17.14
Yıl 2023,
Cilt: 14 Sayı: 4, 440 - 454, 31.12.2023
Ayfer Sayın
,
Sebahat Gören
Kaynakça
- Afrashteh, M. Y. (2021). Comparison of the validity of bookmark and Angoff standard setting methods in medical performance tests. Bmc Medical Education, 21(1). https://doi.org/10.1186/s12909-020-02436-3
- AITSL, A. I. f. T. a. S. L. (2022). AITSL, Australian Professional Standards for Teachers. https://www.aitsl.edu.au/tools-resources/resource/australian-professional-standards-for-teachers
- Attali, Y., Saldivia, L., Jackson, C., Schuppan, F., & Wanamaker, W. (2014). Estimating item difficulty with comparative judgments. ETS Research Report Series, 2014(2), 1-8. http://dx.doi.org/10.1002/ets2.12042
- Beinborn, L., Zesch, T., & Gurevych, I. (2014). Predicting the difficulty of language proficiency tests. Transactions of the Association for Computational Linguistics, 2, 517-530. https://doi.org/10.1162/tacl_a_00200
- Chon, Y. V., & Shin, T. (2010). Item difficulty predictors of a multiple-choice reading test. English Teaching, 65(4), 257-282. http://journal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf
- Clauser, J. C., Hambleton, R. K., & Baldwin, P. (2017). The Effect of Rating Unfamiliar Items on Angoff Passing Scores.Educational and Psychological Measurement, 77(6), 901-916. https://doi.org/10.1177/0013164416670983
- Ell, F. (2021). Teacher education policy in Aotearoa New Zealand: Global trends meet local imperatives. In Teacher Education Policy and Research: Global Perspectives (pp. 113-128). Springer.
- Enright, M. K., Allen, N., & Kim, M. I. (1993). A Complexity Analysis of Items from a Survey of Academic Achievement in the Life Sciences. ETS Research Report Series, 1993(1), i-32. https://files.eric.ed.gov/fulltext/ED385595.pdf
- Fergadiotis, G., Swiderski, A., & Hula, W. D. (2019). Predicting confrontation naming item difficulty. Aphasiology, 33(6), 689-709. https://doi.org/10.1080/02687038.2018.1495310
- Gitomer, D. H., & Qi, Y. (2010). Recent Trends in Mean Scores and Characteristics of Test-Takers on" Praxis II" Licensure Tests. Office of Planning, Evaluation and Policy Development, US Department of Education.
- Grivokostopoulou, F., Hatzilygeroudis, I., & Perikos, I. (2014). Teaching assistance and automatic difficulty estimation in converting first order logic to clause form. Artificial Intelligence Review, 42, 347-367. http://dx.doi.org/10.1007/s10462-013-9417-8
- Hamamoto Filho, P. T., Silva, E., Ribeiro, Z. M. T., Hafner, M. d. L. M. B., Cecilio-Fernandes, D., & Bicudo, A. M. (2020). Relationships between Bloom’s taxonomy, judges’ estimation of item difficulty and psychometric properties of items from a progress test: a prospective observational study. Sao Paulo Medical Journal, 138, 33-39. http://dx.doi.org/10.1590/1516-3180.2019.0459.R1.19112019
- He, J., Peng, L., Sun, B., Yu, L. J., & Zhang, Y. H. (2021). Automatically Predict Question Difficulty for Reading Comprehension Exercises. 2021 Ieee 33rd International Conference on Tools with Artificial Intelligence (Ictai 2021), 1398-1402. https://doi.org/10.1109/Ictai52525.2021.00222
- Impara, J. C., & Plake, B. S. (1998). Teachers' ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35(1), 69-81. https://psycnet.apa.org/doi/10.1111/j.1745-3984.1998.tb00528.x
- Kardong-Edgren, S., & Mulcock, P. M. (2016). Angoff Method of Setting Cut Scores for High-Stakes Testing Foley Catheter Checkoff as an Exemplar. Nurse Educator, 41(2), 80-82. https://doi.org/10.1097/Nne.0000000000000218
- Kibble, J. D., & Johnson, T. (2011). Are faculty predictions or item taxonomies useful for estimating the outcome of multiple-choice examinations? Advances in physiology education, 35(4), 396-401. https://doi.org/10.1152/advan.00062.2011
- Kurdi, G., Leo, J., Matentzoglu, N., Parsia, B., Sattler, U., Forge, S., Donato, G., & Dowling, W. (2021). A comparative study of methods for a priori prediction of MCQ difficulty. Semantic Web, 12(3), 449-465. https://doi.org/10.3233/Sw-200390
- Le Hebel, F., Tiberghien, A., Montpied, P., & Fontanieu, V. (2019). Teacher prediction of student difficulties while solving a science inquiry task: example of PISA science items. International Journal of Science Education, 41(11), 1517-1540. https://doi.org/10.1080/09500693.2019.1615150
- Lin, C.-S., Lu, Y.-L., & Lien, C.-J. (2021). Association between Test Item's Length, Difficulty, and Students' Perceptions: Machine Learning in Schools' Term Examinations. Universal Journal of Educational Research, 9(6), 1323-1332. http://dx.doi.org/10.13189/ujer.2021.090622
- Lin, L. H., Chang, T. H., & Hsu, F. Y. (2019). Automated Prediction of Item Difficulty in Reading Comprehension Using Long Short-Term Memory. Proceedings of the 2019 International Conference on Asian Language Processing (IALP), Shanghai, China, 132-135. https://doi.org/10.1109/IALP48816.2019.9037716.
- Linacre, J.M. (2014). A user's guide to FACETS Rasch-model computer programs. Retrieved from http://www.winsteps.com/a/facets-manual.pdf
- Lumley, T., Routitsky, A., Mendelovits, J., & Ramalingam, D. (2012). A framework for predicting item difficulty in reading tests.
- OSYM. (2022). KPSS: Kamu Personel Seçme Sınavı. https://www.osym.gov.tr/TR,23892/2022-kpss-lisans-genel-yetenek-genel-kultur-ve-egitim-bilimleri-oturumlarinin-temel-soru-kitapciklari-ve-cevap-anahtarlari-yayimlandi-31072022.html
- Pandarova, I., Schmidt, T., Hartig, J., Boubekki, A., Jones, R. D., & Brefeld, U. (2019). Predicting the difficulty of exercise items for dynamic difficulty adaptation in adaptive language tutoring. International Journal of Artificial Intelligence in Education, 29, 342-367. https://doi.org/10.1007/s40593-019-00180-4
- Perikos, I., Grivokostopoulou, F., Kovas, K., & Hatzilygeroudis, I. (2016). Automatic estimation of exercises' item difficulty in a tutoring system for teaching the conversion of natural language into first‐order logic. Expert Systems, 33(6), 569-580. https://doi.org/10.1111/exsy.12182
- Perkins, K., Gupta, L., & Tammana, R. (1995). Predicting item difficulty in a reading comprehension test with an artificial neural network. Language testing, 12(1), 34-53.https://doi.org/10.1177/026553229501200103
- Praxis, E. T. S. (2022). ETS, The Praxis Tests. https://www.ets.org/praxis
- Qiu, Z. P., Wu, X., & Fan, W. (2019). Question difficulty prediction for multiple choice problems in medical exams. Proceedings of the 28th Acm International Conference on Information & Knowledge Management (Cikm '19), 139-148. https://doi.org/10.1145/3357384.3358013
- Sano, M. (2015). Automated capturing of psycho-linguistic features in reading assessment text. Annual meeting of the National Council on Measurement in Education, Chicago, IL,
- Schult, J., & Lindner, M. A. (2018). Judgment Accuracy of German Elementary School Teachers: A Matter of Response Formats? German Journal of Educational Psychology, 32(1-2), 75-87. https://doi.org/10.1024/1010-0652/a000216
- Stadler, M., Niepel, C., & Greiff, S. (2016). Easily too difficult: Estimating item difficulty in computer simulated microworlds. Computers in Human Behavior, 65, 100-106. https://doi.org/10.1016/j.chb.2016.08.025
- Sydorenko, T. (2011). Item writer judgments of item difficulty versus real item difficulty: A case study. Language Assessment Quarterly, 8(1), 34-52. https://doi.org/10.1080/15434303.2010.536924
- Toyama, Y. (2021). What makes reading difficult? An Investigation of the contributions of passage, task, and reader characteristics on comprehension performance. Reading Research Quarterly, 56(4), 633-642. https://doi.org/10.1002/rrq.440
- Urhahne, D., & Wijnia, L. (2021). A review on the accuracy of teacher judgments. Educational Research Review, 32, 100374. https://doi.org/10.1016/j.edurev.2020.100374
- Wauters, K., Desmet, P., & Van Den Noortgate, W. (2012). Item difficulty estimation: An auspicious collaboration between data and judgment. Computers & Education, 58(4), 1183-1193. https://doi.org/10.1016/j.compedu.2011.11.020
- Wyse, A. E. (2018). Equating angoff standard-setting ratings with the rasch model.Measurement-Interdisciplinary Research and Perspectives, 16(3), 181-194. https://doi.org/10.1080/15366367.2018.1483170
- Wyse, A. E. (2020). Comparing cut scores from the angoff method and two variations of the hofstee and beuk methods. Applied Measurement in Education, 33(2), 159-173. https://doi.org/10.1080/08957347.2020.1732385
- Yaneva, V., Ha, L. A., Baldwin, P., & Mee, J. (2020, May). Predicting item survival for multiple choice questions in a high-stakes medical exam. Proceedings of the 12th International Conference on Language Resources and Evaluation (Lrec), 6812-6818. Marseille, France. https://aclanthology.org/2020.lrec-1.841.pdf
- Yim, M. K., & Shin, S. J. (2020). Using the Angoff method to set a standard on mock exams for the Korean Nursing Licensing Examination. Journal of Educational Evaluation for Health Professions, 17(4). https://doi.org/10.3352/jeehp.2020.17.14