-
Different types of reliability
Three general types of estimates:
(1) Test-retest
(2) Parallel forms
(3) Split-half Internal consistency (Cronbach alpha)
-
What is the classical test theory. What are the underlying assumptions?
- “True score theory” based on following principles:
- (1) Person’s score (a.k.a. “observed” or “obtained” score) comprised of two components – a true score and error.
- Thus, X = T+ e where: X = observed score T = True score e= error in measurement
- (2) A true score exists for every measurable attribute of every person-hypothetical score based on an infinite number of
- administrations of test - reflects score if there was no meas. error
- (3) Obtained test scores contain error - Error in measurement is typically indicated by the Standard Error
- of Measurement(SEM)-provides estimate of how much individual’s score would be expected to change on
- re-testing with same/equivalent form of test
- ***SEM – can be used to estimate a band or interval within which person’s true score would be expected to fall
- *Importance of SEM – alerts evaluator to fact that scores are not exact, should be considered estimate of level
- i.e., can appear higher or lower than true score-Note – inverse relationship between value of SEM and
- reliability coefficient
- Assumptions-
- *individual’s true score is uniform across repeated administrations of same test.
- **i.e.,score is fixed, can be “set in stone” in reality – is score set in stone?
- *errors are obtained randomly and are normally distributed
-
What types of info does the reliability coefficient yield? How do we interpret the correlation coefficient in terms of test scores and true variability accounted for.
- Range from 0.0 to 1.0
- *0.0 = complete lack of reliability (all error)/1.0 = perfect reliability/coefficient is <1.0 = extent to which meas. error is present
- can be interpreted as a percentage
- *.90 = 10% of variation in scores attributable to meas. error
- General reliability guidelines:
- .90 and above = test highly reliable
- .70 - .89 = moderate reliability
- <.70 = low reliability (below.6 = unacceptable)
-
What possible implications does do carryover effects have on reliability?
-
What are other sources of error that can affect reliability?
- (1) questionable measurement precision
- (2) item sampling-longer test = increased reliability
- (3) construction of test items
- *should be objective
- * testtakers should have to do little to interpret questions,
- *way items are worded
- (4) Test administration-
- *environmental factors
- **examiner can influence test taker
- **fluctuations in temp of room/ mood of test taker
- (5) Scoring of the test
- *Objectivity - extent to which scores are free of evaluator’s bias
- - objective scores reflect true individual differences, not judgment / opinion of evaluator -essay tests
- typically less reliable than multiple choice –introduce more subjectivity in scoring
- (6) Difficulty of the test- tests that are too easy or too difficult = lower reliability
- *e.g.,range restriction - reliability higher when scores spread out over entire scale – test shows real differences
- i.e., need variability of scores
- (7) Factors related to test-taker- e.g., fatigue, illness, anxiety, inattention, hyperactivity, behavioral outbursts
-
Define Test-Retest reliability
- correlation between scores on a particular measure at two points in time
- *i.e., stability of examinees’ scores between testing and re-testing when same questions and apparatus are used
-
Appropriate applications of test-retest:
- *Evaluation of temporal stability trait-like, dispositional characteristics
- e.g., stability of perfectionism,intelligence
- *Directly measured characteristics
- e.g., baseline levels of hyperactivity w/o intervention
-
Test Retest not typically useful for:
*Measures consisting of limited, fixed sample of items- test-retest tends to overestimate reliability
*Many psychological variables that vary depending on day of administration ne.g., aggression, motivation, depression, anxiety state-like characteristics
-
Cautions of Test Retest
(1) Carryover effects
- (2) Practice effects - Specific type of carryover effect - improvement in score can result simply by being exposed to test
- first time- scores on second admin. tend to be higher than on first admin.
- (3) Time interval between test administrations is crucial = select and evaluate carefully - in general, shorter the interval between administrations, greater the likelihood of carryover effects
- **Well-evaluated tests will report test-retest estimates at different time intervals
-
Parallel forms reliability
*a.k.a. “equivalent forms” Typically preferable to test-retest
- *Defined: Two forms of same test developed - both forms measure common domain - should be = difficulty same rules applied for item selection
- *reliability coefficient = correlation between scores on both forms
- *Both forms must be administered to same set of examinees (also true for test-retest)
- *Preferable to administer tests on same day
- *very rigorous assessment of reliability
- *Not commonly used in practice
- *highly challenging to develop one form of test, let alone 2
- *can be impractical to administer same examinees both forms of measure
- *Pearson r used to estimate reliability for parallel forms (and test-retest)
- *Due to difficulties assoc. with creating two forms, test developers tend to base reliability estimates on one form
- ****i.e., evaluate “internal consistency” by dividing one test into sub-components
-
Parallel forms reliability assumptions
*examinee’s true score is equivalent for both forms
*SEM’s for each form are = level of difficulty for items on each form is the same
-
Split half reliability
- Defined: one test is split into two parts or halves/two halves are scored separately (odd-even system is common)
- *Scores from two halves of test are correlated
- *process tends to underestimate reliability of test overall
- *Underestimates because each “subtest” is half as long as entire test
- **i.e., correlation of halves is typically viewed as a reliable estimate of half of the test
- *Spearman-Brown formula corrects for this - estimates test reliability if each half had been length of full
- test
- **typically referred to as “corrected” split-half reliability - formula - r = 2r/1 + r (r = Pearson r)
- **Results obtained by Spearman-Brown usually accurate only when assumptions are met
- **typically raises estimate of reliability for total test
- Important assumption: variances of both halves of test are = if this condition not satisfied, formula should not be used
-
Internal Consistency
- Decisions about how to split test into two halves can cause problems
- *unequal variances
- *separate scoring of halves
- *ensuring halves are of = difficulty level
- Kuder andRichardson (1973) – developed procedure for estimating reliability without splitting test into halves – KR20
-
KR20
- *Avoids problems of split-half methods
- *Simultaneously considers all possible ways of splitting the items
- *Mathematical proofs show estimates yielded by KR20 are similar to split-half reliabilities obtained by dividing tests in all possible ways
- *Only appropriate for tests in which items are scored either correct (i.e., 1) or incorrect (i.e., 0).
- **e.g., typical classroom tests (multiple choice, true-false,fill-in-blanks).
- KR21 – more simplistic method of estimating reliability
- *rests on assumption that items are all of = difficulty
-
Cronbach alpha
- a.k.a. “coefficient alpha”
- *Many times, responses to test items can’t be classified as “right” or “wrong”
- **e.g., Likert scales with responses ranging from 1-5.
- *Cronbach alpha used in this case
- **considered to be most general formula for determining reliability estimate through internal consistency
-
Other considerations for internal consistency
- *All measures of internal consistency evaluate extent to which items measure same trait or ability
- *If have subscales measuring different abilities/traits, test as a whole will not be internally consistent
- *Split-half and internal consistency estimates are only appropriate for power tests, not speeded tests
-
Overview of test reliability
- *Reliability defined: nextent to which a test or measure yields consistent results across administrations - quantifying examinee consistency/inconsistency
- *same or similar score obtained across administrations = high reliability
- *Reliability – 1st characteristic of psychometric soundness
- *Lack of reliability = inconsistent measurement of performance -scores do not accurately reflect variable being measured
- **e.g., bathroom scale with loose spring
- **e.g., outdoor thermometer
- *Test scores – highly susceptible to measurement error
- **Scores cannot be trusted unless we know they are obtained consistently
- **example - using an assessment tool to make employment decisions 40% of information yielded from tool is attributable to real individual differences. Is this measure useful for making important employment decisions?
-
Reliability con't
- *Reliability = measure of extent to which
- obtained scores are free of measurement
- error
- *Variations across administrations are result
- of random (chance) errors
- In finding test’s reliability, want to determine:
- (1) amount of variability (differences in scores) related to
- purpose of the measure
- (2) variability due to measurement error - Multiple factors that extend beyond
- parameters of test introduce error in measurement
|
|