ABSTRACT
In any examination, it is important that a sufficient mix of items with varying degrees of difficulty be present to produce desirable psychometric properties and increase instructors' ability to make appropriate and accurate inferences about what a student knows and/or can do. The purpose of this “teaching tip” is to demonstrate how examination items can be affected by the quality of distractors, and to present a simple method for adjusting items to meet difficulty specifications.
Item difficulty is a very important statistic in testing and assessment. Item difficulty, typically called a p value, refers to the percentage of students who answered a given item correctly. Higher values (e.g., .80, .90) indicate easier items, whereas lower values (e.g., .40, .50) indicate more difficult items. In any examination, it is important that a sufficient mix of items with varying degrees of difficulty be present to produce desirable psychometric properties and increase instructors' ability to make appropriate and accurate inferences about what a student knows and/or can do. For example, if an examination consists entirely of easy items, most students will be expected to perform exceptionally well. While students' scores will indicate high levels of performance, it is difficult to know if students would demonstrate similarly strong performance if the examination consisted of more difficult items. Thus, our ability to make valid inferences about what a student truly knows becomes more difficult as a result of a very easy examination. Relatedly, important psychometric properties such as reliability and item discrimination may also be affected. Very easy items typically produce homogenous scores and make it difficult to produce reliable distinctions among examinees.1 Items that are too easy may also present low discrimination values. Because discrimination values are determined based on the upper and lower 27% of students' performance (two standard deviations), the statistic has a tendency to produce misleading results to the undiscerning eye when “hard and fast” rules are applied such as “any value below .20 [or .30] is undesirable.” Further, if item response theory (IRT) methods were used to score an examination, any positive point-biserial (or point measure) correlation would be considered acceptable.2 It is for the above reasons that investigation of item statistics should be conducted thoroughly and with careful attention to detail. Otherwise, many quality items may be discarded as items with perceived flaws.
Of course, designing an examination that consists entirely of excessively difficult items generally is also undesirable. Examinations that are too difficult may result in most students performing poorly, which presents many of the same problems noted above with regard to accurate inferences, poor reliability estimates, and poor item discrimination values. For norm-referenced examinations, a common guideline is that the average item should have a difficulty estimate of .50, with most items ranging between .30 and .70 in difficulty so as to better differentiate students' abilities.3 However, most medical education classroom assessments are criterion referenced,4 thus these guidelines generally do not apply to most veterinary educators. This is primarily because item statistics resulting from classroom instruction are significantly influenced by learning, instruction quality, instructional sensitivity effects,5 and potentially test-wiseness strategies.6 The author purposefully stops short of suggesting difficulty guidelines for criterion-referenced assessments, as so many factors are involved that any resulting guidelines might be irresponsible. Nonetheless, it is clear that items generally should vary in difficulty to such a degree that accurate and reliable distinctions can be made between score results, as this is a major hallmark of validity evidence.7
Discarding flawed or poorly functioning items and producing entirely new items can be very laborious for a veterinary educator. As a general rule of thumb, it is recommended that veterinary educators first attempt to revise an item before discarding it entirely. Revising items typically is far easier and far less time-consuming than generating brand new items. The primary method recommended for revising items so as to make them more or less difficult is the practice of nudging and shoving.8
Nudging and shoving refer to the practice of altering item distractors so as to make an item easier or more difficult. A nudge refers to a minor change to an item, such as altering a single distractor. A shove refers to a significant change to an item, such as revising two or more distractors.
To demonstrate how items can be affected by the quality of distractors, consider the following simple example9 of how a multiple-choice question could be asked in different ways to yield very different person and item performance statistics.
Question 1: Who was the seventeenth president of the United States?
-
Abraham Lincoln
-
Andrew Johnson
-
Ulysses S. Grant
-
Millard Fillmore
About 30% of middle school students will be able to answer this item correctly.
Question 2: Who was the seventeenth president of the United States?
-
George Washington
-
Andrew Johnson
-
Jimmy Carter
-
Bill Clinton
About 60% of middle school students would be able to answer this item correctly.
Question 3: Who was the seventeenth president of the United States?
-
The War of 1812
-
Andrew Johnson
-
The Louisiana Purchase
-
A Crazy Day for Sally
About 90% of middle school students are able to answer this item correctly.
Each item asks the same question, but presents distractors with varying degrees of plausibility. Question 1 presents four plausible answers, as Lincoln was the 16th president, Johnson the 17th, Grant the 18th, and Fillmore the 13th. Question 2 provided only one truly plausible answer, as students will likely know Washington was the first president, and Carter and Clinton were too recent (39th and 42nd, respectively). Question 3 contains only one actual person, so the odds of success for this item are exceptionally high.
The plausibility of the distractors had a significant impact on each item's difficulty level. In each example, the correct answer remained but all three distractors were replaced. The changes made to each set of items would be indicative of a shove, as two or more distractors were altered to make the item easier or more difficult. It should be noted that best practices for item writing suggest only plausible distractors should be used.10 Having said that, it is highly unlikely that veterinary education faculty will administer items of such discernibly poor quality as Question 3 to students. Thus, it is important to acknowledge that a shove does not always result in a dramatic change to an item's performance as this factor is largely contingent upon the quality of the revised distractors. Similarly, it is entirely plausible that a nudge could result in a dramatic change to an item's functioning. The terms nudge and shove simply refer to the number of distractors that were revised, and not necessarily the changes that may result with regard to item performance. The following section provides a couple of examples of how veterinary educators may recognize when a nudge or a shove is appropriate.
This item has a difficulty value of .82 (82% of students answered the item correctly) and a discrimination value of .44 (indicating good discrimination). Despite the reasonable difficulty and discrimination values, the item potentially suffers from a poor distractor as no students selected option E.
| Response | Count | Percentage |
|---|---|---|
| A | 7 | 7.1 |
| B* | 81 | 81.8 |
| C | 6 | 6.1 |
| D | 5 | 5.1 |
| E | 0 | 0.0 |
*Correct answer
Based on this information, the instructor should be advised to revisit this item and distractor E. If the instructor, who should also be a content expert on this topic, is committed to providing five response options and agrees the distractor could be improved, then this item would be a good candidate for a nudge.
This item has a difficulty value of .75 (75% of students answered the item correctly) and a discrimination value of .51 (indicating good discrimination). Again, despite the reasonable difficulty and discrimination values, the item potentially suffers from poor distractors as only one student selected options D or E.
| Response | Count | Percentage |
|---|---|---|
| A | 19 | 19.2 |
| B* | 74 | 74.7 |
| C | 5 | 5.1 |
| D | 1 | 1.0 |
| E | 0 | 0.0 |
*Correct answer
Based on this information, the instructor should be advised to revisit this item and pay particular attention to distractors D and E. If the instructor is committed to providing five response options and feels the item could be improved, then this item would be a good candidate for a shove.
It is important to note that many factors are involved in any testing situation, so each item must be investigated with several considerations in mind. A responsible review of item statistics should not simply result in item alterations because of item statistics. Factors such as instructional sensitivity, which refers to the extent to which students' responses are likely influenced by their familiarity with the item and/or its content, are of critical importance. For example, it may be that no one selected a distractor because the content was taught exceptionally well. Alternatively, it may be that no one selected a particular distractor because it was implausible and students were immediately able to dismiss it as a viable option. This underscores why it is critical for the instructor responsible for content delivery to review an item's psychometric performance, as this individual will be most familiar with the item's origin and how content was conveyed to students.
Creating examinations with varying degrees of difficulty is important for several reasons. However, the practice of altering an item to make it more or less difficult can be quite tricky. This article illustrates how item distractors can have a significant influence on an item's difficulty value, and introduces a method for potentially altering an item's difficulty value. Veterinary educators are encouraged to investigate examination items containing poor distractors, even those with desirable psychometric properties, and to consider providing a nudge or a shove to items where appropriate to help improve the overall psychometric properties (e.g., reliability, validity) of the examination.
| 1. | Royal KD. Understanding reliability in higher education student learning outcomes assessment. Qual Approach High Educ. 2011;2(2):8–15 Google Scholar |
| 2. | Royal KD, Gilliland KO, Kernick ET. Using Rasch measurement to score, evaluate, and improve examinations in an anatomy course. Anat Sci Educ. 2014;7(6):450–60. http://dx.doi.org/10.1002/ase.1436. Medline:24431324 Medline, Google Scholar |
| 3. | Miller MD, Linn RL, Gronlund NE, editors. Measurement and assessment in teaching. 10th ed. Upper Saddle River, NJ: Prentice Hall; 2009 Google Scholar |
| 4. | Royal KR, Guskey TR. On the appropriateness of norm and criterion-referenced assessments in medical education. Ear Nose Throat J. 94(7):252, 254; 2015 Medline, Google Scholar |
| 5. | Haladyna T, Roid G. The role of instructional sensitivity in the empirical review of criterion-referenced test items. J Educ Meas. 1981;18(1):39–53. http://dx.doi.org/10.1111/j.1745-3984.1981.tb00841.x Google Scholar |
| 6. | Royal KD, Hedgpeth MW. A novel method for evaluating examination item quality. Int J Psychol Stud. 2015;7(1):17–22. http://dx.doi.org/10.5539/ijps.v7n1p17 Google Scholar |
| 7. | Messick S. Validity. In: Linn RL, editor. Educational measurement. 3rd ed. New York, NY: Macmillan; 1989. p. 13–103 Google Scholar |
| 8. | Sutherland KA, Stahl JA, Woo A. Improving item bank deficits by modifying existing items: a nudge versus a shove. The American Educational Research Association Annual Meeting; 2014 April 3–7; Philadelphia, PA Google Scholar |
| 9. | O'Donnell AM, Reeve J, Smith JK. Educational psychology: reflection for action. 3rd ed., Wiley; 2011 Google Scholar |
| 10. | Haladyna TM. Developing and validating multiple-choice test items. 2nd ed. Mahwah, NJ: Lawrence Erlbaum Associates, Inc; 1999 Google Scholar |