Tampilkan postingan dengan label Advanced Assessment. Tampilkan semua postingan
Tampilkan postingan dengan label Advanced Assessment. Tampilkan semua postingan

Senin, 11 Desember 2023

Advanced Assessment


By: Sitti Fatimah Saleng

1.      Consider the result of the tryout as follows.

a.        What do you know about reliability of the test?
To determine high or low the reliability of the test, we may use the following reference:
Interval score
 rh < 0,200            
Not reliable
0,200 < rh < 0,399        
0,400 < rh < 0,699
0,700 < rh < 0,899
rh > 0,900  
Very high

The reliability (alpha) of this test is 0.186. It is categorized “low” because it is in interval 0.200 < rh 0.399. The best reliability of the test is when the coefficient 1.00. If the coeficient typically less than 1.00, it might caused by the nature of the problem, the situation at the measurement time, the condition of the subjects, and so on.

b.      In general what are level of difficulty, discrimination, and item validity of the items in the test?

                                      1)      Level of difficulty (p)

P indices           
Level of difficulty
0,000 - 0,099
Very difficult (discard or revise totally)
0,100 - 0,299
Difficult (need to be revised)
0,300 - 0,700
Moderate (good)
0,701 - 0,900
Easy (need to be revised)
0,901 - 1,000
Very Easy (discard or revise totally)

2)      Level of discrimination (D)

D indices
Level of discrimination
0,199 £ D
Very low (reject or need to be revised)
0,200 - 0,299
Low/marginal (need to be revised)
0,300 - 0,399
Reasonably good
0,400 ³D
High (very good)

3)      Item validity
The interpretation is the same with level of discrimination.


·         Item no.5
Level of difficulty = 0.216, it is categorized “difficult and need to be revised”,
Level of discrimination = 0.467, it is categorized “high/very good”
Item validity = 0.332, it categorized “reasonably good”

·         Item no.6
Level of difficulty = 0.078, it is categorized “very difficult/need to be revised totally”
Level of discrimination = 0.608, it is categorized “high/very good”
Item validity = 0.331, it categorized “reasonably good”

·         Item No.7
Level of difficulty = 0.333, it is categorized “moderate/good”
Level of discrimination = 0.633, it is categorized “high/very good”
Item validity = 0.488, it is categorized “high/very good”

·         Item No. 8
Level of difficulty = 0.078, it is categorized “very difficult/need to be revised totally”
Level of discrimination = 0.170, it is categorized “very low/reject/need to be revised totally”
Item validity = 0.093, it is categorized “very low/reject/need to be revised totally”

·         Item No.9
Level of difficulty = 0.569, it is categorized “moderate/good”
Level of discrimination = 0.367, it is categorized “reasonably good”
Item validity= 0.291, low/marginal/need to be revised

c.       What follow-up actions will you take on each item as shown in the result of try out?
·         Item no.5: Level difficulty should be in moderate level (need to be revised)
·         Item no.6: Level difficulty should be in moderate level, because it is categorized as very difficult (need to be revised)
·         Item no.7: This items categorized as good item and should be stored in item bank.
·         Item no.8: This item should be discarded or revised totally because level of difficulty, discrimination, and item validity cannot be categorized as a good item.
·         Item no.9: Item validity of this item should be revised.

 Good item difficulty index is a "moderate". It means not too hard or too easy. How big is the good category, depending on the purpose of the test.
The items which are classified into “very good” and considered as a good test, can show the difference of ability of the two groups (high and low ability).  Item no. 8 is considered as poor or very low. It needs to be revised, because item does not discriminate between high & low (near by 0). A good level of discrimination in an item minimum 0.30 (Ebel, 1979).

2.      Why should an assessment tool be valid and reliable? 
When assessment tool is reliable, it will produces stable and consistent results, and when it is valid, it will result correctness in representing actual level of the skill being assessed. For example, an assessment of speaking skill using a paper-and-pencil test that requires the examinees to show their speaking skill by writing and based on the writing the speaking skill is estimated, it will result in the speaking scores with low validity (weak construct-validity evidence), which means that the speaking scores better represent the knowledge of speaking rather than the skill of speaking. If the assessment of speaking skill is administered using an interview test, which requires the examinees to show the skill of speaking by actually talking and based on the talking activity the speaking skill is estimated, then the result of the speaking assessment (or the scores) will have higher validity (with higher construct validity evidence).
. Ebel & Frisbie (1985: 73) said: Reliability depends on the nature of the group tested, the test content, and the conditions of testing. Language skills assessment should be planned and conducted in the best possible way to get the judgement of the skills being assessed that correcly represents the skills being assessed or does not wrongly represent another skill not being assessed (have high construct and content validity) and precisely represents the level of the skills being assessed, or doesn’t too far overestimate or underestimate the true skill being assessed (have high reliability). Low reliability of assessment results indicates that there have been big errors in the process of assessment. The errors may happen because the examinees cannot show their best performance during the assessment process, the raters cannot give their most objective judgement, the assessment process is not conducive enough for the examinees to show their best performance, the content of the assessment instrument is too heterogeneous, the questions are too difficult or too easy, or the distractors are not quite plausible.

3.      How are discrete-point testing, integrative, integrated, and performance based testing different in these areas shown in the table that follows?

Belief about language
Tests associated with the approach
Language can be broken down
into its component parts and
those parts can be tested
Error recognition tests, Phoneme recognition. Yes/No, True/ False answers, Word completion, grammar items, most multiple choice tests. Units of language(discrete points); phonology, graphology, morphology, lexicon,
syntax and discourse.
Providing easily quantifiable data, Making it possible to grab wide coverage of items, efficient,highly reliable and valid testing,
Ignoring the systematic relationship between language elements and the most important purpose of language, lacking meaningful or natural context, knowledge of discrete elements is worthless, etc.

Language competence is a
unified set of interacting abilities
that cannot be tested
Cloze test, dictation, translation, Essays and other coherent writing tasks, Oral interviews and conversation, Reading comprehension tests or other extended samples of real text, Listening comprehension tests

Potential for revealing several abilities not just language abilities but also knowledge of the world and text structure

It has low face validity
The context is not for real and purposeful orientation.
It tested knowledge of how the language worked rather than an ability.
It does not allow for spontaneous production by the candidate and rely on the examiner for the language input

Communicative assessment was adopted as a means of
assessing language acquisition.
True-false question, multiple choice, gap filling summary, cloze test, dictation and open-ended questions
The notion is authenticity, play a major role in communicative testing, such as listening text, taken from the real life target-language communication situation and possessing the characteristics of target-language situation.
Only provide some descriptions of communicative abilty and its components, but not specify how the concept of communicative competence should be operationalized or their relative importance of these components in language proficiency.

Performance -  Based
Performance assessment requires students to demonstrate knowledge and skills, including the process by which they solve problems. This test assumes that all skills must be tested. The classroom is alive.
Group projects, Essays, Experiments testing, Demonstrations, Portfolios, spoken and written production, integrated performance, group performance and other interactive tasks.

Provide teachers with more information about the learning needs of their students and enable them to modify their methods to meet these needs. They also allow students to assess their own progress. It also provides educators with information about what students have learned, not just how well they can learn.
Performance assessments usually include fewer questions and call for a greater degree of subjective judgement than traditional testing methods.
teachers might teach only to the test, thereby narrowing the curriculum and reducing the test's value.

4.      Give an example how ‘dictation’ is conducted based on integrative, communicative, and performance based assessment/testing may be conducted so the difference among them is obvious!

Dictation: Essentially, learners listen to a passage of 100 to 150 words read aloud by an administrator (or audiotape) and write what they hear, using correct spelling. Success on a dictation requires careful listening, reproduction in writing of what is heard, efficient short-term memory.
Here is the example:



Sport helps us to become strong and healthy. There are many kinds of sport: walking, running, hunting, cycling, swimming, and so on. It is- not important what kinds of sport we are going to do as long as we are strong enough to do it.
Healthy people should exercise regularly, no matter how old they are.
The simplest and the best sport is walking. It is also the cheapest one, because we do not need money to do it. A long walk in the evening may help us sleep more than any medicine.
But people today do not like walking. They prefer to drive a car, though they are not in hurry or travelling a long distance. This kind of "disease" comes from our laziness.


Integrative: Dictation requires reading and listening comprehension of the text.
Communicative: Dictation is a listening text, taken from the real life target-language. It requires the examinees to show not only their language understanding, language use but also language performance.
Performance-Based: Dictation is a form of written production and integrated performance. It requires students to demonstrate knowledge and skills.

5.      In what ways are Classical Test Theory and Item Test Theory positive and negative?

§  Classical Test Theory
The advantages of CTT:
q  A large sample size is not necessary for CTT model;
q  The mathematical computation in model parameter estimation is much simpler than item response theory;
q  The related concepts and analyses are straightforward; and
q  The CTT model is much easier to fit the test data (Hambleton & Jones, 1993).

The disadvantages of CTT:
q  Item indices like level of difficulty and discrimination obtained using classical test theory depends on the group examinee
q  Students assessment depends on the chosen instruments for items

§  Item Test Theory or Item Response Theory
The advantages of IRT :
q  IRT does not based on group dependent
q   Students’ score which is described is not dependent test
q   IRT does not need a parallel test to determine test reliability
q   An IRT model requires a measure of accuracy for each score level of ability
q   Items analysis accommodates matching test items to examinee knowledge levels

The disadvantages of IRT:
q  Assumptions underlying the use of IRT models are more stringent than those required of classical test theory
q  IRT models tend to be more complex and the models-output are more difficult to understand, particularly with non-technically oriented audiences
q  IRT models require large samples to obtain accurate and stable parameter estimates.