
Different types of reliability
Three general types of estimates:
(1) Testretest
(2) Parallel forms
(3) Splithalf Internal consistency (Cronbach alpha)

What is the classical test theory. What are the underlying assumptions?
 “True score theory” based on following principles:
 (1) Person’s score (a.k.a. “observed” or “obtained” score) comprised of two components – a true score and error.
 Thus, X = T+ e where: X = observed score T = True score e= error in measurement
 (2) A true score exists for every measurable attribute of every personhypothetical score based on an infinite number of
 administrations of test  reflects score if there was no meas. error
 (3) Obtained test scores contain error  Error in measurement is typically indicated by the Standard Error
 of Measurement(SEM)provides estimate of how much individual’s score would be expected to change on
 retesting with same/equivalent form of test
 ***SEM – can be used to estimate a band or interval within which person’s true score would be expected to fall
 *Importance of SEM – alerts evaluator to fact that scores are not exact, should be considered estimate of level
 i.e., can appear higher or lower than true scoreNote – inverse relationship between value of SEM and
 reliability coefficient
 Assumptions
 *individual’s true score is uniform across repeated administrations of same test.
 **i.e.,score is fixed, can be “set in stone” in reality – is score set in stone?
 *errors are obtained randomly and are normally distributed

What types of info does the reliability coefficient yield? How do we interpret the correlation coefficient in terms of test scores and true variability accounted for.
 Range from 0.0 to 1.0
 *0.0 = complete lack of reliability (all error)/1.0 = perfect reliability/coefficient is <1.0 = extent to which meas. error is present
 can be interpreted as a percentage
 *.90 = 10% of variation in scores attributable to meas. error
 General reliability guidelines:
 .90 and above = test highly reliable
 .70  .89 = moderate reliability
 <.70 = low reliability (below.6 = unacceptable)

What possible implications does do carryover effects have on reliability?

What are other sources of error that can affect reliability?
 (1) questionable measurement precision
 (2) item samplinglonger test = increased reliability
 (3) construction of test items
 *should be objective
 * testtakers should have to do little to interpret questions,
 *way items are worded
 (4) Test administration
 *environmental factors
 **examiner can influence test taker
 **fluctuations in temp of room/ mood of test taker
 (5) Scoring of the test
 *Objectivity  extent to which scores are free of evaluator’s bias
  objective scores reflect true individual differences, not judgment / opinion of evaluator essay tests
 typically less reliable than multiple choice –introduce more subjectivity in scoring
 (6) Difficulty of the test tests that are too easy or too difficult = lower reliability
 *e.g.,range restriction  reliability higher when scores spread out over entire scale – test shows real differences
 i.e., need variability of scores
 (7) Factors related to testtaker e.g., fatigue, illness, anxiety, inattention, hyperactivity, behavioral outbursts

Define TestRetest reliability
 correlation between scores on a particular measure at two points in time
 *i.e., stability of examinees’ scores between testing and retesting when same questions and apparatus are used

Appropriate applications of testretest:
 *Evaluation of temporal stability traitlike, dispositional characteristics
 e.g., stability of perfectionism,intelligence
 *Directly measured characteristics
 e.g., baseline levels of hyperactivity w/o intervention

Test Retest not typically useful for:
*Measures consisting of limited, fixed sample of items testretest tends to overestimate reliability
*Many psychological variables that vary depending on day of administration ne.g., aggression, motivation, depression, anxiety statelike characteristics

Cautions of Test Retest
(1) Carryover effects
 (2) Practice effects  Specific type of carryover effect  improvement in score can result simply by being exposed to test
 first time scores on second admin. tend to be higher than on first admin.
 (3) Time interval between test administrations is crucial = select and evaluate carefully  in general, shorter the interval between administrations, greater the likelihood of carryover effects
 **Wellevaluated tests will report testretest estimates at different time intervals

Parallel forms reliability
*a.k.a. “equivalent forms” Typically preferable to testretest
 *Defined: Two forms of same test developed  both forms measure common domain  should be = difficulty same rules applied for item selection
 *reliability coefficient = correlation between scores on both forms
 *Both forms must be administered to same set of examinees (also true for testretest)
 *Preferable to administer tests on same day
 *very rigorous assessment of reliability
 *Not commonly used in practice
 *highly challenging to develop one form of test, let alone 2
 *can be impractical to administer same examinees both forms of measure
 *Pearson r used to estimate reliability for parallel forms (and testretest)
 *Due to difficulties assoc. with creating two forms, test developers tend to base reliability estimates on one form
 ****i.e., evaluate “internal consistency” by dividing one test into subcomponents

Parallel forms reliability assumptions
*examinee’s true score is equivalent for both forms
*SEM’s for each form are = level of difficulty for items on each form is the same

Split half reliability
 Defined: one test is split into two parts or halves/two halves are scored separately (oddeven system is common)
 *Scores from two halves of test are correlated
 *process tends to underestimate reliability of test overall
 *Underestimates because each “subtest” is half as long as entire test
 **i.e., correlation of halves is typically viewed as a reliable estimate of half of the test
 *SpearmanBrown formula corrects for this  estimates test reliability if each half had been length of full
 test
 **typically referred to as “corrected” splithalf reliability  formula  r = 2r/1 + r (r = Pearson r)
 **Results obtained by SpearmanBrown usually accurate only when assumptions are met
 **typically raises estimate of reliability for total test
 Important assumption: variances of both halves of test are = if this condition not satisfied, formula should not be used

Internal Consistency
 Decisions about how to split test into two halves can cause problems
 *unequal variances
 *separate scoring of halves
 *ensuring halves are of = difficulty level
 Kuder andRichardson (1973) – developed procedure for estimating reliability without splitting test into halves – KR20

KR20
 *Avoids problems of splithalf methods
 *Simultaneously considers all possible ways of splitting the items
 *Mathematical proofs show estimates yielded by KR20 are similar to splithalf reliabilities obtained by dividing tests in all possible ways
 *Only appropriate for tests in which items are scored either correct (i.e., 1) or incorrect (i.e., 0).
 **e.g., typical classroom tests (multiple choice, truefalse,fillinblanks).
 KR21 – more simplistic method of estimating reliability
 *rests on assumption that items are all of = difficulty

Cronbach alpha
 a.k.a. “coefficient alpha”
 *Many times, responses to test items can’t be classified as “right” or “wrong”
 **e.g., Likert scales with responses ranging from 15.
 *Cronbach alpha used in this case
 **considered to be most general formula for determining reliability estimate through internal consistency

Other considerations for internal consistency
 *All measures of internal consistency evaluate extent to which items measure same trait or ability
 *If have subscales measuring different abilities/traits, test as a whole will not be internally consistent
 *Splithalf and internal consistency estimates are only appropriate for power tests, not speeded tests

Overview of test reliability
 *Reliability defined: nextent to which a test or measure yields consistent results across administrations  quantifying examinee consistency/inconsistency
 *same or similar score obtained across administrations = high reliability
 *Reliability – 1st characteristic of psychometric soundness
 *Lack of reliability = inconsistent measurement of performance scores do not accurately reflect variable being measured
 **e.g., bathroom scale with loose spring
 **e.g., outdoor thermometer
 *Test scores – highly susceptible to measurement error
 **Scores cannot be trusted unless we know they are obtained consistently
 **example  using an assessment tool to make employment decisions 40% of information yielded from tool is attributable to real individual differences. Is this measure useful for making important employment decisions?

Reliability con't
 *Reliability = measure of extent to which
 obtained scores are free of measurement
 error
 *Variations across administrations are result
 of random (chance) errors
 In finding test’s reliability, want to determine:
 (1) amount of variability (differences in scores) related to
 purpose of the measure
 (2) variability due to measurement error  Multiple factors that extend beyond
 parameters of test introduce error in measurement

