Item response theory

Table of contents

1 Overview
2 IRT Models
3 Information
4 Estimation
5 A Comparison of Classical and Modern Test Theory

5.1 Scoring

6 A Breif List of References
7 External Links

Overview

Item response theory designates a body of related psychometric theory that predict outcomes of psychological testing such as the difficulty of items or the ability of test-takers. Generally speaking, the aim of item response theory is to understand and improve the reliability of psychological tests.

Item response theory is very often referred to by its acronym, IRT. IRT may be regarded as roughly synonymous with latent trait theory. It is sometimes referred to using the word strong as in strong true score theory or modern as in modern mental test theory because IRT is a more recent body of theory and makes stronger assumptions as compared to classical test theory.

IRT Models

Much of the literature on IRT revolves around item response models. These models relate a person parameter (or, in the case of multidimensional item response theory, a vector of person parameters) to one or more item parameters. For example:

where is the person parameter and , , and are item parameters. This logistic model relates the level of the person parameter and item parameters to the probability of responding correctly. The constant D has the value 1.702 which rescales the logistic function to closely approximate the cumulative normal ogive. (This model was originally developed using the normal ogive but the logistic model with the recaling provides virtually the same model while simplifying the computations greatly.)

The line that traces the probability for a given item across levels of the trait is called the item characteristic curve (ICC) or, less commony, item response function.

The person parameter indicates the individual's standing in the latent trait. The estimate of the person parameter is the individual's test score. The latent trait is the human capacity measured by the test. It might be a cognitive ability, physical ability, skill, knowledge level, attitude, personality characteristic, etc. In a unidimensional model such as the one above, this trait is considered to be a single factor (as in factor analysis). Individual items or individuals might have secondary factors but these are assumed to be mutually independent and collectively othogonal.

The item parameters simply determine the shape of the ICC and in some cases may not have a direct interpretation. In this case, however, the parameters are commonly interpreted as follows. The b parameter is considered to index an item's difficulty. Note that this model scales the items's difficulty and the person's trait onto the same metric. Thus is is valid to talk about an item being about as hard as Person A's trait level or of a person's trait level being about the same as Item Y's difficulty. The a parameter controls how steeply the ICC rises and thus indicates the degree to which the item distinguishes individuals with trait levels above and below the rising slope of the ICC. This parameter is thus called the item discrimination and is correlated with the item's loading on the underlying factor, with the item-total correlation, and with the index of discrimination. The final parameter, c, is the asympotote of the ICC on the left-hand side. Thus it indicates the probability that very low ability individuals will get this item correct by chance.

This model assumes a single trait dimension and a binary outcome; it is a dichotomous, unidimensional model. Another class of models preduct polytomous outcomes. And a class of models exist to predict response data that arise from multiple traits.

Note to reader: Below here, this article is still very much under construction

Information

One of the major contributions of item response theory is the extension or the concept of reliability. Traditionally, reliability refers to the precision of measurement (i.e., the degree to which measurement is free of error). And traditionally, it is measured using a single index defined in various ways, such as the ratio of true and observed score variance. This index is helpful in characterizing a test's average reliability, for example in order to compare two tests. But it is clear that reliability cannot be uniform across the entire range of test scores. Scores at the endges of the test's range, for example, are known to have more error than scores closer to the middle.

Item response theory advances the concept of item and test information to replace reliability. Information is a function that varies across the scale. In general, information functions tend to look "bell-shaped"--although test information functions are much more variable than item information functions. Information is the reciprocal of the standard error of measurement at a given trait level. Thus more information implies less error of measurement. Plots of item information can be used to see how much information an item contributes and to what portion of the scale score range. Highly discriminating items have tall, narrow information functions; they contribute greatly but over a narrow range. Less discriminating items provide less information but over a wider range. Because of local independence, item information functions are additive. Thus, the test information function is simply the sum of the information functions of the items on the exam. Usinf this property with a large item bank, test information functions can be shaped to control measurement error very precisely.

Estimation

A Comparison of Classical and Modern Test Theory

Scoring

After the model is fit to data, each person has a theta estimate. This estimate is their score on the exam. This "IRT score" is computed and interpreted in a very different manner as compared to traditional scores like number or percent correct. However, for most tests, the (linear) correlation between the theta estimate and a traditional score is very high (e.g., .95). A graph of IRT scores against traditional scores shows an ogive shape implying that the IRT score is somewhat better at seperating individuals with low or high trait standing.

It is worth noting the implications of IRT for test-takers. Tests are imprecise tools and the score achieved by an individual (the observed score) is always the true score occluded by some degree of error. This error may push the observed score higher or lower.

Also, nothing about these models refutes human development or improvement. A person may learn skills, knowledge or even so called "test-taking skills" which may translate to a higher true-score.

A Breif List of References

Many books have been written that address item response theory or contain IRT or IRT-like models. This is a partial list, focusing on texts that provide more depth.

Lord, F.M. (1980). Applications of item response theory to practical testing problems. Mahwah, NJ: Erlbaum.

This book summaries much of Lord's IRT work, including chapters on the relationship between IRT and clasical methods, fundamentals of IRT, estimation, and several advanced topics. It's estimation chapter is now dated in that it primarily discusses joint maximum liklihood method rather than the marginal maximum liklihood method implemented by Darrell Bock and his colleages.