Research Article
BibTex RIS Cite

The Unit Testlet Dilemma: PISA Sample

Year 2021, Volume: 8 Issue: 3, 613 - 632, 05.09.2021
https://doi.org/10.21449/ijate.948734

Abstract

Testlets have advantages such as making it possible to measure higher-order thinking skills and saving time, which are accepted in the literature. For this reason, they have often been preferred in many implementations from in-class assessments to large-scale assessments. Because of increased usage of testlets, the following questions are controversial topics to be studied: “Is it enough for the items to share a common stem to be assumed as a testlet?” “Which estimation method should be preferred in implementation containing this type of items?” “Is there an alternative estimation method for PISA implementation which consists of this type of items?” In addition to these, which statistical model to use for the estimations of the items, since they violate the local independence assumption has become a popular topic of discussion. In light of these discussions this study aimed to clarify the unit-testlet ambiguity with various item response theory models when testlets consist of a mixed item type (dichotomous and polytomous) for the science and math tests of the PISA 2018. When the findings were examined, it was seen that while the bifactor model fits the data best, the uni-dimensional model fits quite closely with the bifactor model for both data sets (science and math). On the other hand, the multi-dimensional IRT model has the weakest model fit for both test types. In line with all these findings, the methods used when determining the testlet items were discussed and estimation suggestions were made for implementations using testlets, especially PISA.

References

  • Ackerman, T. A. (1987, April). The robustness of LOGIST and BILOG IRT estimation programs to violations of local independence. Paper presented at the annual meeting of the American Educational Research Association. Washington, DC.
  • Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement: Issues and Practice, 22(3), 37-51. https://doi.org/10.1111/j.1745-3992.2003.tb00136.x
  • Akoğlu, H. (2018). User's guide to correlation coefficients. Turkish Journal of Emergency Medicine, 18(3), 91-93. https://doi.org/10.1016/j.tjem.2018.08.001
  • Baldonado, A. A., Svetina, D., & Gorin, J. (2015). Using necessary information to identify item dependence in passage-based reading comprehension tests. Applied Measurement in Education, 28(3), 202-218. https://doi.org/10.1080/08957347.2015.1042154
  • Bao, H. (2007). Investigating differential item function amplification and cancellation in application of item response testlet models [Doctoral dissertation, University of Maryland]. ProQuest Dissertations and Theses Global.
  • Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168.
  • Cai, L. (2010). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis. Journal of Educational and Behavioral Statistics, 35(3), 307-335. https://doi.org/10.3102/1076998609353115
  • Cai, L., du Toit, S. H. C., & Thissen, D. (2015). IRTPRO: Flexible professional item response theory modeling for patient reported outcomes (version 3.1) [computer software]. SSIInternational.
  • Cai, L., & Hansen, M. (2013). Limited-information goodness-of-fit testing of hierarchical item factor models. British Journal of Mathematical and Statistical Psychology, 66(2), 245-276. https://doi.org/10.1111/j.2044-8317.2012.02050.x
  • Cai, L., & Monroe, S. (2014). A new statistic for evaluating item response theory models for ordinal data. (CRESST Report 839). National Center for Research on Evaluation, Standards, and Student Testing (CRESST).
  • Canivez, G. L. (2016). Bifactor modeling in construct validation of multifactored tests: Implications for understanding multidimensional constructs and test interpretation. In K. Schweizer & C. DiStefano (Eds.). Principles and methods of test construction: Standards and recent advancements (pp. 247-271). Hogrefe Publishers.
  • Chen, W. H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. https://doi.org/10.3102/10769986022003265
  • Chon, K. H., Lee, W., & Ansley, T. N. (2007). Assessing IRT model-data fit for mixed format tests. (CASMA Research Report 26). Center for Advanced Studies in Measurement and Assessment.
  • DeMars, C. E. (2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43, 145–168. https://doi.org/10.1111/j.1745-3984.2006.00010.x
  • DeMars, C. E. (2012). Confirming testlet effects. Applied Psychological Measurement, 36, 104–121. https://doi.org/10.1177/0146621612437403
  • Embretson, S., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah: Lawrence Erlbaum Associates Inc.
  • Fukuhara, H., & Kamata, A. (2011). Functioning analysis on testlet-based items a bifactor multidimensional item response theory model for differential items. Applied Psychological Measurement, 35(8), 604–622. https://doi.org/10.1177/0146621611428447
  • Gibbons, R. D., & Hedeker, D. R. (1992). Full-information bi-factor analysis. Psychometrika, 57, 423–436.
  • Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. SAGE Publications.
  • Holzinger, K. J., Swineford, F. (1937). The Bi-factor method. Psychometrika, 2, 41–54. https://doi.org/10.1007/BF02287965
  • Houts, C. R., & Cai, L. (2013). Flexible multilevel multidimensional item analysis and test scoring [FlexMIRT R user’s manual version 3.52]. Vector Psychometric Group.
  • Ip, E. H. (2010). Interpretation of the three-parameter testlet response model and information function. Applied Psychological Measurement, 34(7), 467 482. https://doi.org/10.1177/0146621610364975
  • Lee, G., Dunbar, S. B., & Frisbie, D. A. (2001). The relative appropriateness of eight measurement models for analyzing scores from tests composed of testlets. Educational and Psychological Measurement, 61, 958 975. https://doi.org/10.1177/00131640121971590
  • Li, Y., Bolt. D. M., & Fu, J. (2005). A test characteristic curve linking method for the testlet model. Applied Psychological Measurement, 29(5), 340 356. https://doi.org/10.1177/0146621605276678
  • Marais, I. D., & Andrich, D. (2008). Effects of varying magnitude and patterns of local dependence in the unidimensional Rasch model. Journal of Applied Measurement, 9, 105–124.
  • Maydeu-Olivares, A., & Joe, H. (2005). Limited and full information estimation and testing in 2" contingency tables: A unified framework. Journal of the American Statistical Association. https://doi.org/10.1198/016214504000002069
  • McDonald, R. P. (2000). A basis for multidimensional item response theory. Applied Psychological Measurement, 24(2), 99 114. https://doi.org/10.1177/01466210022031552
  • Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. ETS Research Report Series, 1992(1), i 30. https://doi.org/10.1002/j.2333 8504.1992.tb01436.x
  • OECD (2019a). “PISA 2018 Mathematics Framework”. in PISA 2018 Assessment and Analytical Framework. OECD Publishing. https://doi.org/10.1787/13c8a22c-en
  • OECD (2019b). “PISA 2018 Science Framework”. in PISA 2018 Assessment and Analytical Framework. OECD Publishing. https://doi.org/10.1787/f30da688-en
  • OECD (2019c). “Scaling PISA data”. in PISA 2018 Technical Report. OECD Publishing. https://www.oecd.org/pisa/data/pisa2018technicalreport/Ch.09-Scaling-PISA-Data.pdf
  • Reise, S. P., Moore, T. M., & Haviland, M. G. (2010). Bifactor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores. Journal of Personality Assessment, 92(6), 544-559. https://doi.org/10.1080/00223891.2010.496477
  • Revelle, W., & Revelle, M. W. (2015). Package ‘psych’. The comprehensive R archive network, 337, 338.
  • Sireci, S. G., Thissen. D., & Wainer, H. (1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28, 237 247. https://doi.org/10.1111/j.1745 3984.1991.tb00356.x
  • Stucky, B. D., & Edelen, M. O. (2014). Using hierarchical IRT models to create unidimensional measures from multidimensional data. In S. P. Reise & D. A. Revicki (Eds.) Handbook of item response theory modelling. (pp. 201-224). Routledge.
  • Stucky, B. D., Thissen, D., & Orlando Edelen, M. (2013). Using logistic approximations of marginal trace lines to develop short assessments. Applied Psychological Measurement, 37(1), 41-57. https://doi.org/10.1177/0146621612462759
  • Toland, M. D., Sulis, I., Giambona, F., Porcu, M., & Campbell, J. M. (2017). Introduction to bifactor polytomous item response theory analysis. Journal of School Psychology, 60, 41-63. https://doi.org/10.1016/j.jsp.2016.11.001
  • Tuerlinckx, F., & De Boeck, P. (2001). The effect of ignoring item interactions on the estimated discrimination parameters in item response theory. Psychological Methods, 6(2), 181–195. https://doi.org/10.1037/1082-989X.6.2.181
  • Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. In W. J. van der Linden & C. A. W Glas (Eds.). Computerized adaptive testing: Theory and practice (pp. 245–269). Springer, Dordrecht.
  • Wainer, H., & Lewis, C. (1990). Toward a psychometrics for testlets. Journal of Educational Measurement, 27(1), 1–14. https://doi.org/10.1111/j.1745-3984.1990.tb00730.x
  • Wainer, H., & Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203–220. https://doi.org/10.1111/j.1745-3984.2000.tb01083.x
  • Wang, W. C., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological Measurement, 29(2), 126-149. https://doi.org/10.1177/0146621604271053
  • Yen, W. M. (1993). Scaling performance assessments Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187 213. https://doi.org/10.1111/j.1745-3984.1993.tb00423.x
  • Yılmaz Kogar, E. (2016). Madde takımları içeren testlerde farklı modellerden elde edilen madde ve yetenek parametrelerinin karşılaştırılması [Comparison of item and ability parameters obtained from different models on tests composed of testlets] [Doctoral dissertation, Hacettepe University]. Hacettepe University Libraries, https://avesis.hacettepe.edu.tr/yonetilen-tez/c2ade6a0-6a2d-4147-beb0-8a3feb0642c5/madde-takimlari-iceren-testlerde-farkli-modellerden-elde-edilen-madde-ve-yetenek-parametrelerinin-karsilastirilmasi

The Unit Testlet Dilemma: PISA Sample

Year 2021, Volume: 8 Issue: 3, 613 - 632, 05.09.2021
https://doi.org/10.21449/ijate.948734

Abstract

Testlets have advantages such as making it possible to measure higher-order thinking skills and saving time, which are accepted in the literature. For this reason, they have often been preferred in many implementations from in-class assessments to large-scale assessments. Because of increased usage of testlets, the following questions are controversial topics to be studied: “Is it enough for the items to share a common stem to be assumed as a testlet?” “Which estimation method should be preferred in implementation containing this type of items?” “Is there an alternative estimation method for PISA implementation which consists of this type of items?” In addition to these, which statistical model to use for the estimations of the items, since they violate the local independence assumption has become a popular topic of discussion. In light of these discussions this study aimed to clarify the unit-testlet ambiguity with various item response theory models when testlets consist of a mixed item type (dichotomous and polytomous) for the science and math tests of the PISA 2018. When the findings were examined, it was seen that while the bifactor model fits the data best, the uni-dimensional model fits quite closely with the bifactor model for both data sets (science and math). On the other hand, the multi-dimensional IRT model has the weakest model fit for both test types. In line with all these findings, the methods used when determining the testlet items were discussed and estimation suggestions were made for implementations using testlets, especially PISA.

References

  • Ackerman, T. A. (1987, April). The robustness of LOGIST and BILOG IRT estimation programs to violations of local independence. Paper presented at the annual meeting of the American Educational Research Association. Washington, DC.
  • Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement: Issues and Practice, 22(3), 37-51. https://doi.org/10.1111/j.1745-3992.2003.tb00136.x
  • Akoğlu, H. (2018). User's guide to correlation coefficients. Turkish Journal of Emergency Medicine, 18(3), 91-93. https://doi.org/10.1016/j.tjem.2018.08.001
  • Baldonado, A. A., Svetina, D., & Gorin, J. (2015). Using necessary information to identify item dependence in passage-based reading comprehension tests. Applied Measurement in Education, 28(3), 202-218. https://doi.org/10.1080/08957347.2015.1042154
  • Bao, H. (2007). Investigating differential item function amplification and cancellation in application of item response testlet models [Doctoral dissertation, University of Maryland]. ProQuest Dissertations and Theses Global.
  • Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168.
  • Cai, L. (2010). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis. Journal of Educational and Behavioral Statistics, 35(3), 307-335. https://doi.org/10.3102/1076998609353115
  • Cai, L., du Toit, S. H. C., & Thissen, D. (2015). IRTPRO: Flexible professional item response theory modeling for patient reported outcomes (version 3.1) [computer software]. SSIInternational.
  • Cai, L., & Hansen, M. (2013). Limited-information goodness-of-fit testing of hierarchical item factor models. British Journal of Mathematical and Statistical Psychology, 66(2), 245-276. https://doi.org/10.1111/j.2044-8317.2012.02050.x
  • Cai, L., & Monroe, S. (2014). A new statistic for evaluating item response theory models for ordinal data. (CRESST Report 839). National Center for Research on Evaluation, Standards, and Student Testing (CRESST).
  • Canivez, G. L. (2016). Bifactor modeling in construct validation of multifactored tests: Implications for understanding multidimensional constructs and test interpretation. In K. Schweizer & C. DiStefano (Eds.). Principles and methods of test construction: Standards and recent advancements (pp. 247-271). Hogrefe Publishers.
  • Chen, W. H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. https://doi.org/10.3102/10769986022003265
  • Chon, K. H., Lee, W., & Ansley, T. N. (2007). Assessing IRT model-data fit for mixed format tests. (CASMA Research Report 26). Center for Advanced Studies in Measurement and Assessment.
  • DeMars, C. E. (2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43, 145–168. https://doi.org/10.1111/j.1745-3984.2006.00010.x
  • DeMars, C. E. (2012). Confirming testlet effects. Applied Psychological Measurement, 36, 104–121. https://doi.org/10.1177/0146621612437403
  • Embretson, S., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah: Lawrence Erlbaum Associates Inc.
  • Fukuhara, H., & Kamata, A. (2011). Functioning analysis on testlet-based items a bifactor multidimensional item response theory model for differential items. Applied Psychological Measurement, 35(8), 604–622. https://doi.org/10.1177/0146621611428447
  • Gibbons, R. D., & Hedeker, D. R. (1992). Full-information bi-factor analysis. Psychometrika, 57, 423–436.
  • Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. SAGE Publications.
  • Holzinger, K. J., Swineford, F. (1937). The Bi-factor method. Psychometrika, 2, 41–54. https://doi.org/10.1007/BF02287965
  • Houts, C. R., & Cai, L. (2013). Flexible multilevel multidimensional item analysis and test scoring [FlexMIRT R user’s manual version 3.52]. Vector Psychometric Group.
  • Ip, E. H. (2010). Interpretation of the three-parameter testlet response model and information function. Applied Psychological Measurement, 34(7), 467 482. https://doi.org/10.1177/0146621610364975
  • Lee, G., Dunbar, S. B., & Frisbie, D. A. (2001). The relative appropriateness of eight measurement models for analyzing scores from tests composed of testlets. Educational and Psychological Measurement, 61, 958 975. https://doi.org/10.1177/00131640121971590
  • Li, Y., Bolt. D. M., & Fu, J. (2005). A test characteristic curve linking method for the testlet model. Applied Psychological Measurement, 29(5), 340 356. https://doi.org/10.1177/0146621605276678
  • Marais, I. D., & Andrich, D. (2008). Effects of varying magnitude and patterns of local dependence in the unidimensional Rasch model. Journal of Applied Measurement, 9, 105–124.
  • Maydeu-Olivares, A., & Joe, H. (2005). Limited and full information estimation and testing in 2" contingency tables: A unified framework. Journal of the American Statistical Association. https://doi.org/10.1198/016214504000002069
  • McDonald, R. P. (2000). A basis for multidimensional item response theory. Applied Psychological Measurement, 24(2), 99 114. https://doi.org/10.1177/01466210022031552
  • Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. ETS Research Report Series, 1992(1), i 30. https://doi.org/10.1002/j.2333 8504.1992.tb01436.x
  • OECD (2019a). “PISA 2018 Mathematics Framework”. in PISA 2018 Assessment and Analytical Framework. OECD Publishing. https://doi.org/10.1787/13c8a22c-en
  • OECD (2019b). “PISA 2018 Science Framework”. in PISA 2018 Assessment and Analytical Framework. OECD Publishing. https://doi.org/10.1787/f30da688-en
  • OECD (2019c). “Scaling PISA data”. in PISA 2018 Technical Report. OECD Publishing. https://www.oecd.org/pisa/data/pisa2018technicalreport/Ch.09-Scaling-PISA-Data.pdf
  • Reise, S. P., Moore, T. M., & Haviland, M. G. (2010). Bifactor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores. Journal of Personality Assessment, 92(6), 544-559. https://doi.org/10.1080/00223891.2010.496477
  • Revelle, W., & Revelle, M. W. (2015). Package ‘psych’. The comprehensive R archive network, 337, 338.
  • Sireci, S. G., Thissen. D., & Wainer, H. (1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28, 237 247. https://doi.org/10.1111/j.1745 3984.1991.tb00356.x
  • Stucky, B. D., & Edelen, M. O. (2014). Using hierarchical IRT models to create unidimensional measures from multidimensional data. In S. P. Reise & D. A. Revicki (Eds.) Handbook of item response theory modelling. (pp. 201-224). Routledge.
  • Stucky, B. D., Thissen, D., & Orlando Edelen, M. (2013). Using logistic approximations of marginal trace lines to develop short assessments. Applied Psychological Measurement, 37(1), 41-57. https://doi.org/10.1177/0146621612462759
  • Toland, M. D., Sulis, I., Giambona, F., Porcu, M., & Campbell, J. M. (2017). Introduction to bifactor polytomous item response theory analysis. Journal of School Psychology, 60, 41-63. https://doi.org/10.1016/j.jsp.2016.11.001
  • Tuerlinckx, F., & De Boeck, P. (2001). The effect of ignoring item interactions on the estimated discrimination parameters in item response theory. Psychological Methods, 6(2), 181–195. https://doi.org/10.1037/1082-989X.6.2.181
  • Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. In W. J. van der Linden & C. A. W Glas (Eds.). Computerized adaptive testing: Theory and practice (pp. 245–269). Springer, Dordrecht.
  • Wainer, H., & Lewis, C. (1990). Toward a psychometrics for testlets. Journal of Educational Measurement, 27(1), 1–14. https://doi.org/10.1111/j.1745-3984.1990.tb00730.x
  • Wainer, H., & Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203–220. https://doi.org/10.1111/j.1745-3984.2000.tb01083.x
  • Wang, W. C., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological Measurement, 29(2), 126-149. https://doi.org/10.1177/0146621604271053
  • Yen, W. M. (1993). Scaling performance assessments Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187 213. https://doi.org/10.1111/j.1745-3984.1993.tb00423.x
  • Yılmaz Kogar, E. (2016). Madde takımları içeren testlerde farklı modellerden elde edilen madde ve yetenek parametrelerinin karşılaştırılması [Comparison of item and ability parameters obtained from different models on tests composed of testlets] [Doctoral dissertation, Hacettepe University]. Hacettepe University Libraries, https://avesis.hacettepe.edu.tr/yonetilen-tez/c2ade6a0-6a2d-4147-beb0-8a3feb0642c5/madde-takimlari-iceren-testlerde-farkli-modellerden-elde-edilen-madde-ve-yetenek-parametrelerinin-karsilastirilmasi
There are 44 citations in total.

Details

Primary Language English
Subjects Studies on Education
Journal Section Articles
Authors

Cansu Ayan This is me 0000-0002-0773-5486

Fulya Barış Pekmezci 0000-0001-6989-512X

Publication Date September 5, 2021
Submission Date September 24, 2020
Published in Issue Year 2021 Volume: 8 Issue: 3

Cite

APA Ayan, C., & Barış Pekmezci, F. (2021). The Unit Testlet Dilemma: PISA Sample. International Journal of Assessment Tools in Education, 8(3), 613-632. https://doi.org/10.21449/ijate.948734

23823             23825             23824