The Mismeasure of Students: Using Item Response Theory Instead of Traditional Grading to Assess Student Proficiency

Imagine for a second that you’re teaching a math remediation course full of fourth graders. You’ve just administered a test with 10 questions. Of those 10 questions, two questions are trivial, two are incredibly hard, and the rest are equally difficult. Now imagine that two of your students take this test and answer nine of the 10 questions correctly. The first student answers an easy question incorrectly, while the second answers a hard question incorrectly. How would you try to identify the student with higher ability?

Under a traditional grading approach, you would assign both students a score of 90 out of 100, grant both of them an A, and move on to the next test. This approach illustrates a key problem with measuring student ability via testing instruments: test questions do not have uniform characteristics. So how can we measure student ability while accounting for differences in questions?

Item response theory (IRT) attempts to model student ability using question level performance instead of aggregate test level performance. Instead of assuming all questions contribute equally to our understanding of a student’s abilities, IRT provides a more nuanced view on the information each question provides about a student. What kind of features can a question have? Let’s consider some examples.

First, think back to an exam you have previously taken. Sometimes you breeze through the first section, work through a second section of questions, then battle with a final section until the exam ends. In the traditional grading paradigm described earlier, a correct answer on the first section would count just as much as a correct answer on the final section, despite the fact that the first section is easier than the last! Similarly, a student demonstrates greater ability as she answers harder questions correctly; the traditional grading scheme, however, completely ignores each question’s difficulty when grading students!

The one-parameter logistic (1PL) IRT model attempts to address this by allowing each question to have an independent difficulty variable. It models the probability of a correct answer using the following logistic function:

where j represents the question of interest, theta is the current student’s ability, and beta is item j’s difficulty. This function is also known as the item response function. We can examine its plot (with different values of beta) below
to confirm a couple of things:

  1. For a given ability level, the probability of a correct answer increases as item difficulty decreases. It follows that, between two questions, the question with a lower beta value is easier.
  2. Similarly, for a given question difficulty level, the probability of a correct answer increases as student ability increases. In fact, the curves displayed above take a sigmoidal form, thus implying that the probability of a correct answer increases monotonically as student ability increases.

Now consider using the 1PL model to analyze test responses provided by a group of students. If one student answers one question, we can only draw information about that student’s ability from the first question. Now imagine a second student answers the same question as well as a second question, as illustrated below.

We immediately have the following additional information about both students and both test questions:

  1. We now know more about student 2’s ability relative to student 1 based on student 2’s answer to the first question. For example, if student 1 answered correctly and student 2 answered incorrectly we know that student 1’s ability is greater than student 2’s ability.
  1. We also know more about the first question’s difficulty after student 2 answered the second question. Continuing the example from above, if student 2 answers the second question correctly, we know that Q1 likely has a higher difficulty than Q2 does.
  1. Most importantly, however, we now know more about the first student! Continuing the example even further, we now know that Q1 is more difficult than initially expected. Student 1 answered the first question correctly, suggesting that student 1 has greater ability than we initially estimated!

This form of message passing via item parameters is the key distinction between IRT’s estimates of student ability and other naive approaches (like the grading scheme described earlier). Interestingly, it also suggests that one could develop an online version of IRT that updates ability estimates as more questions and answers arrive!

But let’s not get ahead of ourselves. Instead, let’s continue to develop item response theory by considering the fact that students of all ability levels might have the same probability of correctly answering a poorly-written question. When discussing IRT models, we say that these questions have a low discrimination value, since they do not discriminate between students of high- or low-ability. Ideally, a good question (i.e. one with a high discrimination) will maximally separate students into two groups: those with the ability to answer correctly, and those without.

This gets at an important point about test questions: some questions do a better job than others of distinguishing between students of similar abilities. The two-parameter logistic (2PL) IRT model incorporates this idea by attempting to model each item’s level of discrimination between high- and low-ability students. This can be expressed as a simple tweak to the 1PL:


How does the addition of alpha, the item discrimination parameter, affect our model? As above, we can take a look at the item response function while changing alpha a bit:


As previously stated, items with high discrimination values can distinguish between students of similar ability. If we’re attempting to compare students with abilities near zero, a higher discrimination sharply decreases the probability that a student with ability < 0 will answer correctly, and increases the probability that a student with ability > 0 will answer correctly.

We can even go a step further here, and state that an adaptive test could use a bank of high-discrimination questions of varying difficulty to optimally identify a student’s abilities. As a student answers each of these high-discrimination questions, we could choose a harder question if the student answers correctly (and vice versa). In fact, one could even identify the student’s exact ability level via binary search, if the student is willing to work through a test bank with an infinite number of high-discrimination questions with varying difficulty!


Of course, the above scenario is not completely true to reality. Sometimes students will identify the correct answer by simply guessing! We know that answers can result from concept mastery or filling in your Scantron like a Christmas tree. Additionally, students can increase their odds of guessing a question correctly by ignoring answers that are obviously wrong. We can thus model each question’s “guess-ability” with the three-parameter logistic (3PL) IRT model. The 3PL’s item response function looks like this:


where chi represents the item’s “pseudoguess” value. Chi is not considered a pure guessing value, since students can use some strategy or knowledge to eliminate bad guesses. Thus, while a “pure guess” would be the reciprocal of the number of options (i.e. a student has a one-in-four chance of guessing the answer to a multiple-choice question with four options), those odds may increase if the student manages to eliminate an answer (i.e. that same student increases her guessing odds to one-in-three if she knows one option isn’t correct).

As before, let’s take a look at how the pseudoguess parameter affects the item response function curve:


Note that students of low ability now have a higher probability of guessing the question’s answer. This is also clear from the 3PL’s item response function (chi is an additive term and the second term is non-negative, so the probability of answering correctly is at least as high as chi). Note that there are a few general concerns in the IRT literature regarding the 3PL, especially regarding whether an item’s “guessability” is instead a part of a student’s “testing wisdom,” which arguably represents some kind of student ability.

Regardless, at Knewton we’ve found IRT models to be extremely helpful when trying to understand our students’ abilities by examining their test performance.


de Ayala, R.J. (2008). The Theory and Practice of Item Response Theory, New York, NY: The Guilford Press.
Kim, J.S., Bolt, D (2007). “Estimating Item Response Theory Models Using Markov Chain Monte Carlo Methods.” Educational Measurement: Issues and Practices 38 (51).
Sheng, Y (2008). “Markov Chain Monte Carlo Estimation of Normal Ogive IRT Models in MATLAB.” Journal of Statistical Software 25 (8).

Also, thanks to Jesse St. Charles, George Davis, and Christina Yu for their helpful feedback on this post!

What's this? You're reading N choose K, the Knewton tech blog. We're crafting the Knewton Adaptive Learning Platform that uses data from millions of students to continuously personalize the presentation of educational content according to learners' needs. Sound interesting? We're hiring.

21 thoughts on “The Mismeasure of Students: Using Item Response Theory Instead of Traditional Grading to Assess Student Proficiency

  1. how do you account for “trivial” errors? ( maybe i missed something by skimming through)

    • What do you mean by “trivial” errors? Are these errors in grading, or student’s errors on questions that might be “trivial?”

    • That’s why a good IRT assessment would have enough questions at each given level that if a test taker missed one because he/she wasn’t paying attention, there would still be an opportunity to demonstrate proficiency at that level. The idea is that if a person consistently misses something, maybe there is an issue that needs to be addressed. …but having said that, no test is perfect and that’s why basing your knowledge of a person’s ability only on a test score would not be an idea approach.

  2. I think the author of this paper has little or no experience writing and grading exams in the real world. People are not machines and answers to problems aren’t just wrong or right. There is no rubric, no matter how carefully defined that can’t be defied once students actually take the exam. Even the opening example is flawed. Why would someone who wrote an exam make the easy questions worth the same as the hard ones. This type of approach to grading only works with multiple choice type exams or similar and it’s been clearly documented how poor those already are at testing student ability.

    • You’re 100% right–as a newly-enrolled grad student, I don’t have experience grading exams in the real world! Two things I’d like to address, though:

      1. The opening example was overly simplified, but consider the case where questions are assigned points based on their “difficulty.” Should the professor determine a question’s difficulty in this scenario? More importantly, can a professor accurately assess that question’s difficulty accurately? You question whether an answer can be “absolutely correct,” and I’d similarly argue that a professor would struggle to identify a question’s “absolute difficulty” and assign the right number of points to a correct answer. Maybe the professor thinks a particular question has an obvious answer, but the curriculum has not prepared students to answer correctly!

      By using IRT models to analyze test results, however, we can estimate question difficulty and better inform partial grading decisions. That’s the beauty of IRT models: they can yield information about any part of the testing process, and not just student ability.

      2. I agree with the assertion about IRT applying only to multiple choice exams, since it was developed with those kinds of exams (as opposed to open-ended, essay-type instruments) in mind. Still, there’s a question about the causality behind multiple choice exams and ability measurement. Do those exams struggle to measure student ability because they’re poor tests of ability, or is the traditional grading process failing us in some way?

      Thanks for your thoughts!

    • You’re right, people are not machines however, a good assessment provides enough questions (and these questions must have been validated to be used as a measure) to provide a reasonable estimate of a person’s ability.

      And no, the opening example is not flawed. It is a simplification of the approach to grading in Classical Test Theory (CTT). Likewise, keep in mind that this explanation of Item Response Theory (IRT) is also simplified for those of us that are not psychometricians. There is an entire methodology that goes into determining what makes a question “easy” or “difficult”…and that makes it possible to create a rubric to identify correct answers. A great resource (if you’re interested) is to read: Assessing the Complexity of Literacy Tasks by Julian Evetts.

  3. Pingback: Weekly links for June 10 « God plays dice

  4. Cranky professor – Multiple choice being a poor estimator of student ability would be news to those at ETS. I think it’s worth giving the author of this blog the benefit of the doubt since Item Response Theory is pretty well established and widely used.

  5. I remember taking the computerized GRE in 1999 when they still used the per-question adaptive test. They did something similar to what you’re proposing – as you get the introductory/trivial questions correctly, start giving harder questions until full competency is calculated.

    Back then the ‘Analytics’ section were logic puzzles, instead of the current written exam. I remember towards the end of the analytics portion, there was a search path that was 8 to 9 levels deep necessary to solve the correct combination, unlike any of the practice exam questions that I had seen. I ended up working through the problem in like 5-6 minutes.

    When I was done, and thought about how hard that problem was along with the fact that I knew that it was an adaptive exam, I thought to myself “I must have done pretty good!”. It was a strange feedback loop.

  6. I have a question now….
    can you tell me now that in a multiple choice exam which is to be conducted for 7-8 days. And a candidate can give the exam on a single day from this 7-8 days exam schedule.(date of exam is choosen by candidate.he can select 1 date from the exam schedule.)

    Now the question is when he should give the exam?
    On date when less candidate are giving the exam or on a date when majority of candidates are giving the exam.
    (Like most of the people select middle dates of exam or later dates. )

    And one thing more that a single result will be prepared for the exam in the last.
    So right selection of date can make some variation in result.

    Eager to know.
    Thanks in advance.

    • The date of the exam should not matter. The item responses are aggregated and run through an IRT model at the end of the exam period. The only difference this would make is if the examinee prefeers being around less or more people. The item parameters and items will remain the same.

  7. in IRT multiple choice question with no negative marking type…
    to get good score what should one do?
    to attempt all questions or not?
    or just leave 5 to 7 question s out of 300 multiple choice questions.
    or attempt only one having full confidence??

    dhruv patel

    • do no. of incorrect responses due to guessing matter when analysis is through irt

    • Depends on what type of IRT model is being run. If it is a CAT(computer adapted testing) test then focus on the first and then rush through the rest. If it is a paper based test or an online non CAT test then try to answer all of the items and guess on the ones you don’t know.

  8. How do you score someone using IRT? Are you using the most likely value of ability, i.e. one that fits the data best. Doesn’t this mean that you don’t know what a test taker’s score should be, you are just taking the most likely value?

  9. A practical problem – In exam a of 300 questions of MCQs, we encounter 20-30 questions, which are absolute guesswork for all the participants (all the 20-30 are just facts based on numbers/statistical figures which for sure none of the student know the correct, all of them will do guesswork without any clue of any-sorts and blindly). What is better, to take shot blindly or leave all the 20-30 questions?!

Comments are closed.