Abstract
The testlet comprises a set of items based on a common stimulus. When the testlet is used in the tests, there may violate the local independence assumption, and in this case, it would not be appropriate to use traditional item response theory models in the tests in which the testlet is included. When the testlet is discussed, one of the most frequently used models is the testlet response theory (TRT) model. In addition, the bi-factor model and traditional 2PL models are also used for testlet-based tests. This study aims to examine the item parameters estimated by these three calibration models of the data properties produced under different conditions and to compare the performances of the models. For this purpose, data were generated under three conditions: sample size (500, 1000, and 2000), testlet variance (.25, .50, and 1), and testlet size (4 and 10). For each simulation condition, the number of items in the test was fixed at i = 40 and 100 replications were made under each condition. Among these models, it was concluded that the TRT model gave less biased results than the other two models, but the results of the bi-factor model and the TRT were more similar as the sample size increased. Among the examined conditions, it was determined that the most effective variable in parameter recovery was the sample size.