Presented By:
o
Fifin
Naili Riskiyah
o
Justsinta
Sindi Alivi
o
Sitti
Fatimah Saleng
o
Uliya
Nafida
Item Response
Theory (IRT)
1.
Differences
between classical test theories (CTT) and Item Response Theories
Area
|
CTT
|
IRT
|
Model
|
Linear
|
Non-linear
|
Level
|
Test
|
Item
|
Assumption
|
Weak
(e.g.,easy to meet with test data)
|
Strong
(e.g. more difficult to meet with test data)
|
Item-ability
relationship
|
Not
specified
|
Item
characteristic functions
|
Ability
|
Test
scores or estimated true scores are reported on the test score scale (or a
transformed test-score scale)
|
Ability
scores are reported on the scale - ∞ to + ∞ (or a transformed scale)
|
Invariance
of item and person statistics
|
No
– item and person parameters are sample dependent
|
Yes
– item and person parameters are sample independent, if model fits the test
data
|
Item
statistics
|
p,
r
|
b,
a, and c (for the three parameters model) plus corresponding item information
functions
|
2.
Advantages
and disadvantages of IRT
The advantages of IRT :
q IRT
does not based on group dependent
q Students’ score which is described is not
dependent test
q IRT does not need a paralel test to determine
test reliability
q An IRT model requires a measure of accuracy
for each score level of ability
q Items analysis accommodates matching test
items to examinee knowledge levels
The disadvantages of
IRT:
q Assumptions
underlying the use of IRT models are more stringent than those required of
classical test theory
q IRT
models tend to be more complex and the models-output are more difficult to
understand, particularly with non-technically oriented audiences
3.
Item
Calibration and Ability Estimation
A teacher wants to give her
students a math test to assess their skill level at the beginning of the school
year. She develops two hard questions, one average question and two easy
questions. After she gives the test to 5 students, she uses Item Response
Theory (IRT) to determine different characteristics of each question.
Math test:
1.
105,100/ 167 =
2.
673 x 465 =
3.
6 x 7 =
4.
5 – 3 =
5.
2 + 2 =
Item 1
|
Item 2
|
Item 3
|
Item 4
|
Item 5
|
Average
|
|
Person 1
|
1
|
1
|
1
|
1
|
1
|
1
|
Person 2
|
0
|
1
|
1
|
1
|
1
|
0.8
|
Person 3
|
0
|
0
|
1
|
1
|
1
|
0.6
|
Person 4
|
0
|
0
|
0
|
1
|
1
|
0.4
|
Person 5
|
0
|
0
|
0
|
0
|
1
|
0.2
|
In this example, student 1 who answered
all five items correctly, is tentatively considered to possess 100%
proficiency. Student 2 has 80% proficiency, student 3 has 60%, student 4 has
40%, and student 5 has 20%.
These score in terms of percentages
are considered tentative because first, in IRT there is another set of
terminology for proficiency.
Second, we cannot judge a student’s
ability just based on the number of correct item only. Rather the item
attribute, such as its difficulty level, should also be taken into account.
In the preceding example, no
students have the same scores. But what would happen if there is a student, say
student 6, whose score is the same as that of student 4?
Person 4
|
0
|
0
|
0
|
1
|
1
|
0.4
|
Person 5
|
0
|
0
|
0
|
0
|
1
|
0.2
|
Person 6
|
1
|
1
|
0
|
0
|
0
|
0.4
|
In this case we cannot draw a firm
conclusion that they have the same level of proficiency because student 4
answered two easy items correctly, whereas student 6 answered two hard
questions correctly.
This example is an ideal case in
which more proficient students answer all items correctly, and less proficient
students answer the easier item and fail the hard ones. However, these results
rarely occur in reality and are just “too good to be true”. This ideal case is
known as The Guttman Pattern.
We can also make a tentative assessment
of the item attribute based on this example.
Item
1
|
Item
2
|
Item
3
|
Item
4
|
Item
5
|
Average
|
|
Person
1
|
1
|
1
|
1
|
1
|
1
|
1
|
Person
2
|
0
|
1
|
1
|
1
|
1
|
0.8
|
Person
3
|
0
|
0
|
1
|
1
|
1
|
0.6
|
Person
4
|
0
|
0
|
0
|
1
|
1
|
0.4
|
Person
5
|
0
|
0
|
0
|
0
|
1
|
0.2
|
Average
|
0.8
|
0.6
|
0.4
|
0.2
|
0
|
Item 1 seems to be the most
difficult because only one person out of five could answer correctly.
It is tentatively asserted that the difficulty
level in terms of the failure rate for item 1 is 0.8, meaning 80% of students
were unable to answer the item correctly in other words, the item is so
difficult that it can “beat” 80% of students. Please note that for person
proficiency we count the number of successful answers. But for item difficulty
we count the number of failures.
Item 5 is the easiest because 100%
or all of the students were able to answer the item correctly. The difficulty
level is 0. However, as you might expect, the issue will become more
complicated when some items have the same pass rate but are answered correctly
by students of different skill levels.
In this example, item 1 and item 6
have the same difficulty level (0.8).
Item 1
|
Item 2
|
Item 3
|
Item 4
|
Item 5
|
Item 6
|
Average
|
|
Person 1
|
1
|
1
|
1
|
1
|
1
|
0
|
0.83
|
Person 2
|
0
|
1
|
1
|
1
|
1
|
0
|
0.67
|
Person 3
|
0
|
0
|
1
|
1
|
1
|
0
|
0.50
|
Person 4
|
0
|
0
|
0
|
1
|
1
|
0
|
0.33
|
Person 5
|
0
|
0
|
0
|
0
|
1
|
1
|
0.33
|
Average
|
0.8
|
0.6
|
0.4
|
0.2
|
0
|
0.8
|
However, item 1 was answered
correctly by a person who has high proficiency (83%) whereas item 6 was not.
The person who answered item 6 correctly has 33% proficiency. It is possible
that the wording in item 6 tends to confuse good students. Therefore, the item
attribute of item 6 is not clear-cut.
We call the portion of correct
answers for each person, tentative
student proficiency (TSP). We call the pass rate for each item tentative item difficulty (TID).
Given the tentative information we
obtained about item difficulty and student proficiency, we can predict the
probability of answering a particular item correctly given the proficiency
level of a student.
Item 1
|
Item 2
|
Item 3
|
Item 4
|
Item 5
|
TSP
|
|
Person 1
|
0.55
|
0.60
|
0.65
|
0.69
|
0.73
|
1
|
Person 2
|
0.50
|
0.55
|
0.60
|
0.65
|
0.69
|
0.8
|
Person 3
|
0.45
|
0.50
|
0.55
|
0.60
|
0.65
|
0.6
|
Person 4
|
0.40
|
0.45
|
0.50
|
0.55
|
0.60
|
0.4
|
Person 5
|
0.35
|
0.40
|
0.45
|
0.50
|
0.55
|
0.2
|
TID
|
0.80
|
0.60
|
0.40
|
0.20
|
0.00
|
For example, the probability that
person 1 can answer items correctly is 0.73. there is no surprise person 1 has
a tentative proficiency of 1. But tentative difficulty of item 5 is 0. Person 1
is definitely “smarter” or “better” than item 5. The probability that person 2
can answer item 1 correctly is 0.5.The tentative student proficiency is 0.8,
and the tentative item difficulty is also 0.8.
In other words, the person’s
ability “matches” the item difficulty. When the student has a 50% chance to
answer the item correctly, the students has no advantage over the item and vise
versa.
When you move your eyes across the
diagonal from upper left to lower right, you will see a “match” (0.5) between a
person and an item several times.
Item 1
|
Item 2
|
Item 3
|
Item 4
|
Item 5
|
TSP
|
|
Person 1
|
0.55
|
0.60
|
0.65
|
0.69
|
0.73
|
1
|
Person 2
|
0.50
|
0.55
|
0.60
|
0.65
|
0.69
|
0.8
|
Person 3
|
0.45
|
0.50
|
0.55
|
0.60
|
0.65
|
0.6
|
Person 4
|
0.40
|
0.45
|
0.50
|
0.55
|
0.60
|
0.4
|
Person 5
|
0.35
|
0.40
|
0.45
|
0.50
|
0.55
|
0.2
|
TID
|
0.80
|
0.60
|
0.40
|
0.20
|
0.00
|
However, if we put the table of
probability and the table of raw scores together, we will find something
strange.
Item 1
|
Item 2
|
Item 3
|
Item 4
|
Item 5
|
Average
|
|
Person 1
|
1
|
1
|
1
|
1
|
1
|
1
|
Person 2
|
0
|
1
|
1
|
1
|
1
|
0.8
|
Person 3
|
0
|
0
|
1
|
1
|
1
|
0.6
|
Person 4
|
0
|
0
|
0
|
1
|
1
|
0.4
|
Person 5
|
0
|
0
|
0
|
0
|
1
|
0.2
|
Average
|
0.80
|
0.60
|
0.40
|
0.20
|
0.00
|
According to upper table, the
probability of person 5 answering item 1 to 4 correctly ranges from 0.35 to
0.5. But actually the students failed all four items. As you see, the data and the probability
model do not necessarily fir together. In IRT, this discrepancy can be used to
further calibrate the estimation until the data and the model converge.
In short, both items attribute and
the student proficiency should be taken into consideration in order to conduct item calibration and proficiency estimation. The data give
us the tentative student proficiency
and item difficulty, which are used
to fit the model, and then the model is used to predict the data. Needless to
say, there will be some differences between the model and the data in the
initial steps. It takes many cycles to reach convergence.
Tidak ada komentar:
Posting Komentar