Consumer’s Guide to Social and Emotional Assessment, Part II: Reliability and Its Consequences

This is the second part in our blog series about how to evaluate the merits of an SEL assessment. The last post provided an overview of judging whether an assessment is up to the task. Upcoming blog posts will demystify the concepts of factor structure and validity.

Reliability: The “Steady Eddie” of Psychometrics

Today, let’s talk about reliability. Here’s my goal: you should be able to spend ten minutes reading this post and have everything it takes to evaluate whether the reliability of an assessment you are using or considering is good enough to accomplish your assessment goal.

Nitpicky but important point #1: Reliability does not refer to the assessment; it refers to the scores. Thus, an assessment can’t be reliable, but the scores it produces can.

So what is reliability? Reliability refers to the reproducibility of the scores produced by an assessment. A reliable score is one that is consistent, where consistency refers to the reproducibility of the score from one item to the next, from one testing occasion to the next, and from one rater to another. In that sense, it is a kind of “Steady Eddie” quotient characterizing the consistency of measurement.

A specific example might help. Let’s consider the reliability of a score that is easy to understand—your height. Typically, we measure with some form of measuring tape or yardstick. Consider the consequences of the material used to make the yardstick. If it’s made of a very hard metal and is used to measure height in the same manner every time, you should get a similar result each time. I’m about 6′3″. A reliable measuring system will yield a very similar score each time.

It is easy to imagine a less reliable method of measuring height—using a rubber measuring stick or a highly elastic string, for example. What is the consequence of the material choice? If I measure myself ten times with these less reliable methods, each time, I will get a different answer. It won’t be a total loss though, because the scores will tend to cluster around my true height. The imperfect method just introduces a level of uncertainty that permits an estimated range more than a precise number. I might learn, for example, that there’s a 95% chance that my height is somewhere between 5′11″ and 6′7″. The higher the reliability of the score, the smaller that range will be.

Key point (not nitpicky!): The lower a score’s reliability, the more likely a child’s score on an assessment is to over- or under-estimate their level of competence. With lower reliability, interpret individual student scores with caution.

As this example shows, the consequence of unreliability this: the lower a score’s reliability, the further any single measurement might be from the thing that the assessment is designed to measure. We can flip this formulation: The higher the reliability, the closer any individual score will be to the characteristic being measured. This is true of our height example. It is also true of SEL assessments. The higher the reliability of the scores produced by an SEL assessment, the closer any individual student’s score will be to the competence the assessment is designed to measure. The lower the reliability of SEL assessment scores, the more an individual measure may under- or over-estimate a student’s actual competence.

Because social and emotional competencies are invisible processes that cannot be directly measured in the way that height can, they require us to infer a child’s level of competence from behaviors or performance on a task. As a result, all SEL assessments include some amount of error, by which I mean they don’t measure with perfect reliability. The good news is that we can quantify reliability to judge whether an assessment is good enough for the task we would have it do.

Many factors can affect reliability. The way an assessment and items are designed can affect consistency of measurement.

The Kinds of Reliability

Reliability is usually judged in terms of consistency of scores across items on the assessment, assessment occasions, and raters. Let’s consider each so you know what they are and how to judge them.

Consistency across items is also called internal consistency reliability. This refers to the extent to which respondents with particular level of competence tend to perform consistently across items with similar difficulty levels that make up the overall score. Let’s say an assessment is designed to assess children’s ability to recognize others’ emotions from their facial expressions. For the assessment, children look at pictures and indicate what each person is feeling. When children correctly label the face as happy, sad, angry or whatever, they get a point. Incorrect responses get zero points.

If the assessment is internally consistent, children should perform similarly across comparably difficult items that make up the total score. High internal consistency means the child’s true skill level is well and precisely estimated by the score. Low internal consistency means that the child’s true skill level is not precisely estimated by the score. It can also mean the assessment is measuring more than one thing, making it difficult to interpret the meaning of the score.

Test-retest reliability is usually determined by correlating scores from one administration of an assessment with scores from the same assessment administered later. Often, the interval between administrations is two weeks. The longer the interval, the lower the correlation will be.

Consistency across time is also called test-retest reliability. This refers to the extent to which respondents tend to achieve a similar score when they are assessed more than once. Back to the ruler example: good test-retest reliability means that repeated measures of my height yield a very similar score (6′3″ish). Similarly SEL assessments should yield similar scores on repeated measurement.

It is important to note, however, that unlike height, we expect behavior, skills, and knowledge to change over relatively short periods of time. As a result, it is not always necessary for an assessment to have extremely high test-retest reliability, particularly over long time intervals during which we expect children’s skills to develop at different rates. Still, it is an important criterion for you to consider as you evaluate your assessment options.

Taking the Measure of Reliability

There are other forms of reliability, but internal consistency and test-retest reliability are most commonly reported and are arguably the most important. Each score that an assessment gives you will have an associated internal consistency reliability and test-retest reliability that you can research and use to evaluate the suitability of candidate assessments. Reliability scores range from 0 to 1, where 0 means not at all reliable and 1 means perfectly reliable.

So what reliability score is good enough? There is not complete agreement on that matter. However, here are a few general guidelines to consider. They describe internal consistency reliability. Test-retest reliabilities can be a little lower and be adequate.

An internal consistency reliability of .70 may be good enough for reporting group level performance, aggregated at the classroom level or above, but is not good enough to assess individual student strengths and needs, particularly for high-stakes decision such as placement in special education, diagnosing a child with a condition, or evaluating teacher performance. That is because with a reliability of .70, a child’s score on an assessment may be quite different from one administration to another, increasing the chances that you will come to an incorrect, and potentially harmful, conclusion, about a child from the assessment’s scores.

An internal consistency reliability of .80 may be good enough to evaluate individual student strengths and needs, but not to make high-stakes decisions, including diagnosis or placement in special education services. If interpreted well, scores with this level of reliability may help provide formative information educators can use to tune instruction in a way that builds on the student’s strengths and addresses her needs.

An internal consistency reliability of at least .90 is important for high-stakes decision making, although frankly, I don’t recommend using SEL assessments of any kind for high-stakes decision-making.

Test-retest reliabilities can be a little lower, particularly if the interval between testing occasions is longer than two weeks. Interrater reliabilities are a bit of a different animal, and because they are not relevant to many assessment types, I won’t go into detail here. Regardless of the kind or level of reliability, however, it is important to note that the appropriateness of using assessment scores to achieve a specific goal depends on other factors in addition to reliability.

When it comes to reliability, for most educational applications, shoot for internal consistencies of .80 or higher and test-retest reliabilities of .70 or higher.

If Reliability is too Low, What Should I Do?

Let’s say you are using an assessment that has a reliability of .75. Should you not use it? Not necessarily. If the assessment measures what you are interested in, it can be helpful. But you should consider a couple of courses of action to decrease the odds you will inadvertently mis-estimate a student’s competence level from the score.

First, consider reviewing aggregated scores. We have found that classroom level reports that show the percentage of students scoring at different levels on an SEL assessment are specific enough to make instructional decisions without risking forming an inaccurate expectation of any individual student.

Second, if you must look at individual student scores, consider the confidence interval around the score. That is a range of scores within which the student’s true level of competence is likely to fall. If the assessment maker provides confidence intervals you can easily see how much you should take any individual student’s scores with a grain of salt. An “average” score with a low reliability, for example, may have an 80% likelihood of falling somewhere between well above average and well below average, so interpret with caution.

Now, reliability is important, but it is not the only consideration when judging the quality of an assessment. Stay tuned for the next installments of the Consumer’s Guide to Social and Emotional Assessment, where we will consider how to judge the ability of an assessment to achieve the goals you set for it.