# Reliability

## Definition

If a test is unreliable, then although the results for one use may actually be valid, for another they may be invalid. Reliability is thus a measure of how much you can trust the results of a test.

Tests often have high reliability – but at the expense of validity. In other words, you can get the same result, time after time, but it does not tell you what you really want to know.

## Stability

Stability is a measure of the repeatability of a test over time, that it gives the same results whenever it is used (within defined constraints, of course).

Test-retest reliability is the repeatability of test over time to get same results with the same person and needs to be done to assure the stability of a test. Stability, in this case, is the variation in the scores that is taken. Problems with this include:

• Carry-over effect: people remembering answers from last time.
• Practice effect: repeated taking of test improves score (typical with classic IQ tests).
• Attrition: People not being present for re-tests.

There is an assumption with stability that what is being measured does not change. Variation should be due to the test, not to any other factor. Sadly, this is not always true.

## Consistency

Consistency is a measure of reliability through similarity within the test, with individual questions giving predictable answers every time.

Consistency can be measured with split-half testing and the Kuder-Richardson test.

### Split-half testing

Split-half testing measures consistency by:

• Dividing the test into two (usually a mid-point, odd/even numbers, random or other method)
• Administering them as separate tests.
• Compare the results from each half.

A problem with this is that the resultant tests are shorter and can hence lose reliability. Split-half is thus better with tests that are rather long in the first place.

Use Spearman-Brown’s formula to correct problems of shortness, enabling correlation as if each part were full length:

r = (2rhh)/(1 + rhh)

(Where rhh is correlation between two halves)

### Kuder-Richardson reliability or coefficient alpha

The Kuder-Richardson reliability or coefficient alpha is relatively simple to do, being  based on one administration of the test. It assesses inter-item consistency of test by looking at two error measures:

• Heterogeneity of domain being sampled

It assumes reliable tests contain more variance and are thus more discriminating. Higher heterogeneity leads to lower inter-item consistency. For right/wrong scores that are non-dichotomous items:

Rkk = k / (k – 1(1 – Σσ2i/σ2t))

Where Rkk is alpha coefficient of test, k is number of items, σ2i is item variance, σ2t is test variance

### Equivalence of results (parallel form)

Seeks reliability through equivalence between two versions of the same test, comparing results from each version of test (like split-half). It is better than test-retest as it can be done the same day (reducing variation).

There is a danger of tests with high internal validity having limited coverage (and hence lower final validity).

Bloated specifics are where similar questions lead to apparent significance. This can be bad when unintended, but can be used to create deliberate variations.

Parallel versions are useful in such situations as with graduates who may do the same test several times.
An adverse effect occurs where different groups score differently (potential racial, etc. bias). This may require different versions of the same test – eg. MBTI for different countries.

## Discussion

There are a number of procedural aspects that affect test reliability, including:

• Test conditions
• Variation in test marking
• Application of an inappropriate norm group
• Internal state of test-taker (tired, etc.)
• Experience level of test-taker (eg. if taken test before).

