Usable Knowledge

Daniel Koretz

Measure for measures: What do standardized tests really tell us about students and schools?

Measure for measures: What do standardized tests really tell us about students and schools?
HGSE Professor Daniel Koretz

A test is like a poll

Daniel Koretz: The one that really matters the most, the misconception that matters the most is the notion somehow a good test measures all of what’s important. A good test is like a political poll. It’s a very small sample of something much larger. For example, we give in Massachusetts and many states give tests in high school that are supposed to summarize what kids have learned in mathematics over a course of, most cases, ten years.

Well, hopefully they’ve learned a great deal, but the number of actual questions on that test is very small. So just as you predict a presidential election by polling 500 or 700 or 1,000 out of 120 million voters, you sample from this big domain of achievement a modest number of things that allow you to predict the whole. That’s all a test is, and its value is only as a tool for estimating what kids really know about the whole.

Failing to understand that underlies, I think, a lot of the unfortunate consequences of high stakes testing today because people think that if they teach what’s on the test, they must be doing the right thing. It’s analogous to saying if we persuade the people in the next poll to vote one way or the other, we’ve won the election.

Improvement vs. accountability

Daniel Koretz: Testing for instructional improvement and testing for accountability are, in some sense, opposite ends of the continuum. They’re not necessarily in direct conflict. If you go back to the days when testing was lower pressure, let’s say you gave a test to fourth graders in your district and they found that they don’t know how to do long division very well compared to other fourth graders. Well, what would be a logical response? You’d want teachers to put more emphasis on long division.

That may seem like a trivial example, but that’s part of the purpose of testing, to allow you to know what you need to do differently or better. That’s the goal of accountability as well, to tell people that we’ve now figured out what they do well and what they don’t do well, and to push them to do some things better.

It’s really a matter of degree, that if the pressure becomes too severe, then people game the system. And this is not a problem limited to education; it’s just everywhere you look. So for example, some years ago, the British National Health Service imposed time limits on the amount of time that patients could be waiting in emergency rooms, which for people who’ve waited in emergency rooms would seem like a very good thing to do. And it was a good thing to do, but unfortunately people gamed the system in a number of ways, one of which is that some hospitals kept patients in queues of ambulances out on the street until they had enough room that they were confident that they could get them through in four hours.

Well, the answer to that problem isn’t, stop worrying about wait times. It’s, find a better way to hold hospitals accountable for keeping wait times short. And the same is true in education. The answer to the current problems we’re seeing is not, in my view, stop holding schools accountable for teaching kids. It’s, find a better way to do it, one that has fewer side effects.

The problem with “bins”

Daniel Koretz: Breaking people into these bins — below basic, basic, proficient, advanced — has, in my view, been one of the worst decisions we made in testing in decades. And there are many reasons for that. One of them is that it’s a very insensitive way to report performance. So imagine that you’re teaching in a really poor school and your state has high standards. So a lot of your kids are way below the basic standard, and you get them up a really large distance in the first couple of years you’re teaching there. But most of them don’t quite get to the basic threshold.

That will show up in today’s reporting systems as no progress whatever. Whereas some other person who has figured out what’s now called the bubble kids, who the bubble kids are, the kids right on the cusp, and moves a bunch of them from just below that cut score to just above will look like they’re making huge progress. So the first problem is that. It just is a very coarse and potentially misleading indictor.

The second is, it creates really bad incentives. And we’re now beginning to slowly accumulate research confirming that when you reward teachers for getting kids across one line, they focus on the kids who are near the line to the detriment of others, some teachers do.

The third reason is much more complicated to explain, but I think very important, which is that if you report as we now do by law, report performance in terms of statistics like percent of kids who are proficient, and you ask really fundamental questions, like, “Are African-American kids catching up with whites or falling further behind” (it’s hard to come up with a more important question than that) you will always get the wrong answer if you report in terms of the percent above a cut score.

To explain the mathematics now would take more time than you want, I think, but the basic bottom line is that a change in that percentage depends on two things. One is, how much progress is each group really making? And the second is, how many of the kids in that particular group are bunched up near that cut score? And this has been shown-- This is not just an abstract possibility. It’s been shown, for instance, very clearly with the national assessment. If you look at reasonable measures, you’ll see that African-Americans have resumed gaining on whites. They gained for many years. That progress stopped in the ‘90s, and it just resumed.

If you look at percent above cut scores, the percent who are in the above basic, above proficient categories, what you get is a hodgepodge. For some cut scores, it looks like African-Americans are gaining. For some cut scores, it looks like they’re falling further behind. And that’s completely an artifact of where the cut scores are set.

