STATS ARTICLES 2010
Adding it up: How good is Value Added Modeling at measuring teacher performance?
Rebecca Goldin Ph.D, September 29, 2010
After a widely respected teacher, Rigoberto Ruelas, commited suicide, his family blamed the Los Angeles Times, which recently gave him an “average” and “slightly less effective” rating in a new system to measure teacher performance in the city. The newspaper said it published the database “because it bears directly on the performance of public employees who provide an important service, and in the belief that parents and the public have a right to judge the data for themselves.” But a detailed examination by STATS asks whether people were really able to “judge the data for themselves” when it was the product of such a complex and problematic model.
We all want our children to succeed at learning. And we all agree that schools and teachers play an important part in making that happen. It only stands to reason, then, that we would want the system – the schools, the teachers, the administration, and the kids – to be accountable for learning. This simple yet fundamentally noble idea spurred the development of No Child Left Behind (NCLB) legislation. And now No Child Left Behind is quickly becoming No Teacher Left Unaccountable, as states seize on techniques to evaluate the teachers (and sometimes, the schools).
The basic premise of NCLB is that any child can and should succeed at meeting certain “benchmarks” that individual states determine. But who gets the credit when kids meet that minimal level, and who is to blame when they do not? The teachers, the schools, the parents, the administrators, the “cultural ambiance”, social constraints, or the kids themselves? While we might, cavalierly, respond, “All of the above,” from a practical point of view we have to attribute value to each piece of the puzzle; and the first question is: how much is attributable to the teacher?
When we ask ourselves, “What makes a good teacher?” we might come up with a variety of answers: someone inspirational, someone who makes kids love to learn, someone with whom the kids are learning (as measured by an assortment of metrics), someone who can manage a classroom and create an environment for creativity and exploration; or, we might look for something concrete, someone who can raise test scores.
The elephant (or, perhaps, the donkey) in the room is how to evaluate teachers. If using the percentage of kids who meet a benchmark is a bad measurement (due to reasons discussed below) how can we improve on it? Is there any way to use the large amount of data from NCLB tests to figure out how effective a teacher is?
Value-added modeling is all about trying to answer this question. While it brings new metrics to the table, it also has important limitations, especially if used as a single measurement. In the wake of a teacher suicide that has been in part attributed to a negative assessment by a value added model, we pose the key question: how accurately can we judge a teacher by the data?
The Problem with Absolute Measures
If we evaluate teachers based on absolute standards, we immediately run into problems. Imagine that two equally skilled teachers teach very different classes – one teacher gets well cared-for kids with involved, educated parents in a school with many resources, and the other teacher gets kids with a variety of home-based problems such as poor nutrition, violence in their neighborhoods, and a school that does not always provide books. There is little doubt about which group of kids would be more successful at basic mathematics and reading. It would be unreasonable to attribute the differences to the teachers.
When NCLB first rolled out, standardized tests were used to evaluate the kids and determine whether they had obtained basic knowledge dictated by state standards. It quickly became clear that some kids were far more prepared for school (and for the tests) than others. Poorer children with less educated parents consistently do worse on these tests than their wealthier counterparts; black and Hispanic kids do worse than white and Asian kids; immigrants and new English speakers do worse than native English speakers. Consistently, teachers note that scores on standardized tests were heavily dependent on factors out of their control.
But in both highly affluent communities as well as those plagued with the challenges of poverty, there are teachers who deserve the apple, and those who deserve the boot. The question is how to distinguish them, when students can be so different in the classroom. Clearly, an absolute measure such as the percentage of students scoring above a benchmark is problematic, if only because kids from different demographics areso different in the classroom. The idea of doing a relative assessment of teachers is ripe for statistical interpretation.
Using the Data to Evaluate Teachers
We all know that there are effective and ineffective teachers, but how can we pare away all the factors that go into what a kid learns each year, and isolate the part that the teacher is responsible for?
Enter Value Added Modeling [VAM]. VAM is a statistical technique that takes into account an enormous amount of data on kids’ backgrounds and their performance on standardized tests, and attempts to isolate the part of their progress that can be attributed to teachers. VAM is increasingly being used to hold teachers accountable, sometimes with consequences, including bonuses for “good” teaching and firings for “bad” teaching.
The New York Times recently described VAM fairly accurately, without the technical details.Educational jurisdictions that use VAM include the states of California, Tennessee, North Carolina, Colorado, Texas as well as the cities New York and Chicago. The Los Angeles Times shook up Los Angeles by publishing a VAM analysis of the city’s teachers’ talent designed by the economist Richard Buddin of the Rand Corporation. (Buddin also wrote an accompanying technical article.)
The basic premise is that we should be able to evaluate the performance of the teachers by looking at how well their students do, using data from standardized tests. VAMs avoid using absolute measures of test scores, since as we pointed out, some kids are far more prepared for school than others before they even enter the classroom.Instead, they apply statistical magic to compare test results for very different types of kids.The problem is difficult and persists despite the millions of dollars of funding that has gone into evaluating schools, teachers, and students. One proposed solution, adopted by the VAM community, is that we should not evaluate students’ absolute progress, but rather their relative improvement. As The New York Times put it, “A student whose third-grade scores were higher than 60 percent of peers statewide is predicted to score higher than 60 percent of fourth graders a year later.If, when actually taking the state tests at the end of fourth grade, the student scores higher than 70 percent of fourth graders, the leap in achievement represents the value the fourth-grade teacher added.”The question is, does it really?
According to one of the most widely cited scholarly articles on the topic (McCaffrey, et al., 2004), the models conclude that teachers’ contribution to the variability of outcome (test scores) is in the range of 4 to 20 percent, depending on the model. If we can attribute increases in test scores to good teaching, then maybe VAM can be used to identify, incentivize, and implement good teaching practices. On the other hand, we have to be very careful with what exactly we are measuring.
A Class of Statistical Models
Value-added modeling is not one particular model, but rather a class of statistical models. Some models might be better than others at getting data to fit the models, or predicting future educational success based on teachers or based on schools. Some VAM take into account demographic aspects of kids (poverty, race, gender, etc.) and others do not. Some make assumptions that others do not. Many scholarly articles (and many value-added models) attempt to get around very real and persistent difficulties with holding teachers accountable for student performance. However, skeptics of VAM note that none of these models successfully address these difficulties.
Each VAM has its own features, but there are some common features to all. The first is the reliance on test data as a proxy for learning. In other words, we assume that test data on kids reflects their knowledge. This assumption has some consequences, which we elaborate on below – but as a first pass, it should be clear that VAM does not take into consideration other proxies for educational success, such as graduation rates, grades, interviews, attitudes toward specific topics, or any educational goal that goes unmeasured by standardized tests.
The Pros and Cons of Using Test Data to Evaluate Teachers
The idea of evaluating progress in a relative sense is deceptively simple. Even in a perfect world, in which students are assigned to random teachers at random schools and all the students stay in one place for the whole year, some immediate problems arise. First and most ignored is how well the test data on the students actually reflect their learning.
Arguably, state tests amply recognize accomplishment close to the state level, either above or below. Such tests may well be accurate in schools with students operating near the margin of competency. However, for kids who are far from the average – either above or below – the test scores may misrepresent the knowledge base. This is a problem with the testing range.
For example, let us suppose Jenny has deep mathematical knowledge and talent, and she is also good at taking standardized tests. While she already knows fractions, her third-grade curriculum is all about multiplication and division. She scores in the 95th percentile on a state exam. The following year, in fourth grade, she hasn’t learned a thing. She still knows fractions, but now the curriculum involves fractions. She scores again in the 95th percentile, having learned nothing new. In contrast, her twin brother Jonny who has similar skills in third grade, has an astute teacher who notices his talent. Jonny is given some additional guidance, and has progressed fairly well into learning basic algebra in third grade and some advanced geometry in fourth grade. He also scores in the 95th percentile both years. In other words, both kids had advanced knowledge and the test could not differentiate between how much they knew, or the fact that Jonny had an inspired teacher who brought him to greater mathematical maturity.
A similar problem happens when students are far below the average. Imagine that Molly has trouble doing basic sums in third grade. Her teacher works with her independently, gives her additional exercises at her level, and starts to introduce her techniques to add two-digit numbers and to subtract numbers under 20. She makes tremendous progress in third grade, but the test will not show any progress at all since she is still performing far below grade level. In contrast, her brother Michael of similar ability may have had a teacher who essentially gave up on him – and he would score similarly to Molly on the third and fourth grade standardized tests.
In addition, there are many learning objectives that these tests do not (and cannot) reflect, such as love of reading, artistic expression, mathematical exploration, and, more generally, enthusiasm for learning.There are also many measures of success, such as grade-completion, retention, and attendance, which are not included in the models.
Another concern about high stakes standardized testing is that the more seriously we take the data from tests, the more we encourage “teaching to the test” and “test-taking skills.” This can be a major problem for how accurately a test measures knowledge. For example, the SAT tests general vocabulary and the ability to reason verbally. Suppose two kids have similar vocabulary levels, but one is given some test taking techniques, such as knowledge about how test writersinvent the wrong choices and how to use process of elimination to increase her chances of a correct answer. The other is simply told to fill in the bubble properly and erase answers completely.
The differences in test scores between these two students may be quite appreciable, though it does not reflect a difference in knowledge about verbal reasoning. Similarly, if one student is given a list of vocabulary words to study that have commonly appeared on the test, the test would no longer be an accurate reflection of her verbal knowledge as a whole. In mathematics, teaching the material that will be tested may well be at the expense of material that is not tested – and that would change test scores without indicating significantly different mathematical knowledge.
But there are good arguments for using data as well. For one, it is objective in a way that graduation rates or grades are not. It forces teachers to actually teach the material that they are required to teach, rather than avoid their least favorite topics. It focuses attention on the kids with the greatest difficulty in attaining minimal proficiency by making them a “high priority” for schools. It makes clear that some core material is considered more important than other material (mathematics, for example, is considered a higher priority than art). At least in part, real data tells us about real kids and where they are in their skills. And while certain ranges of progress cannot be measured by the tests, new tests are often being developed to deal with that issue. For example, some states are introducing dynamic tests that ask different questions based on whether the test-taker got a previous question correct or not.
The entire VAM idea is based on data collected on students. Its ability to judge the teachers can at best only be as good as the data itself. For this reason, many VAM proponents suggest that they only be interpreted in the context of reading and math skills, which are the subject areas for which there is a lot of test data. Standardized tests, for better or worse, serve as a proxy for better measurements of learning. For example, we may want to test the ability of students to reason logically about a mathematical problem and solve it. A multiple-choice test will invariably test something slightly different, likely emphasizing the computational aspect of a problem, not the reasoning skills. Similarly, if we want to test how well students can write an essay, our ideal test would have students write an essay. But the cost constraints of standardized tests may mean we have to test related skills using multiple-choice.
Using only test scores to evaluate teachers has some of the same pitfalls as using only testing to evaluate students: if a teacher, for example, spends time teaching his students about mathematical reasoning at cost to time reviewing basic skills, he may be evaluated more poorly than a colleague who emphasizes computational proficiency alone. From the point of view of minimal competency, we may want teachers to emphasize skills over reasoning, but VAM is based on a scale, and does not have “cut-offs” after which reasoning or other advanced mathematical skills may be valued.
How Good Are The Models?
Even if we assume that the data does reflect everything we wish to measure about students’ progress in school, or at least enough to move forward, there are still issues that make VAM challenging.
The general formula for VAM is one that assumes that test scores depend linearly on several observed and unobserved inputs. In the case of the Los Angeles Times analysis, these included test scores from the previous year, classroom characteristics, gender, race, parents’ education, special needs, teacher characteristics, and “persistence of prior-year learning”, which takes into account the proportion of the year that the student was with the teacher. The model also makes some assumptions, including that students’ assignments to teachers are independent of unobserved characteristics. As we discuss below, several such fundamental assumptions may not be empirically justified.
The issue of how the model takes into account increments in test scores relies on a value judgment. Should increases in test scores be viewed as identical for a student going from the 50th to the 55th percentiles for a student progressing from the 90th to the 95th percentile? One could argue that making progress so near the top is harder than making progress while in the middle. One way or another, the models need to decide how to value the progress. And if someone whose score was as good as guessing suddenly looks slightly better than guessing, we have no idea if that represents tremendous progress or very minor progress. The models must put relative value on all of these changes, in order to attribute it to the teachers in a comparable way.
There is also a subtler issue at play in how these models work. Statisticians are concerned with the built-in bias of statistical models. These are aspectsthat, by the nature of the model itself, will favor some kinds of teachers over others. In many cases, including with VAM, bias is possible or unknown because it depends on many unobserved factors.
Built-in bias. For example, suppose that, based on seniority, teachers may choose certain classes to teach (or even certain schools to teach in). Perhaps they tend to choose classes with students who are more enthusiastic or advanced learners, or classes with more resources. Due to seniority alone, some teachers may end up with a class filled with students who will have more progress during the year compared to their peers for a variety of reasons having nothing to do with their teachers. Yet, compared to a rookie teacher who did not have the choice of classes/schools, the senior teacher will seem to have encouraged her students to make big strides. This possible bias in the model is difficult to tease outbecause elements such as “enthusiasm” or even “advanced learner” are often not included – or cannot be included – in the model; and yet, such elements are observed either explicitly or implicitly by teachers, and have an impact on which class the teachers may teach.
One way to fix this problem would be not with the model, but with the way that students are assigned to classes. In an ideal world, we could randomize the process by assigning students randomly to classrooms (within a school, or even better, within a school system) and then randomly assign the teachers to these schools. This would cancel out the bias that comes from several sources: students are regularly “tracked” in school, classes of students are put together based on factors that teachers and administrators observe (such as behavioral issues), parents push for particular opportunities such as a better school, and that teachers tend to gravitate toward classes in which they feel they will be successful.
Yet the reality on the ground is that parents and teachers have a great influence on students’ placement. This is most obvious in the context of school choice, where parents who are concerned with educational quality may decide to live in an area where the schools have a better reputation, or if given a choice within a neighborhood, choose one school over another based on a variety of factors. This bias can in part be addressed by introducing a value-added school model, but this may introduce other biases since better teachers may cluster at the same school.
Demographics. Another major concern is how the test takes into account various factors. VAM does not always take into account student characteristics (such as race, gender or income). While these factors are highly correlated to performance, proponents of not including them in VAM claim that they are not correlated with progress. However, some research suggests that in some cases these factors do play a role in progress. The issue then is how to (if at all) adjust for these factors in order to remove their influence on the performance evaluation of a teacher.
In many models, unobserved characteristics can have an unexpected effect. This can be tricky, because we do not know how these characteristics change the effect attributed to teachers. For example, family wealth may be correlated with “English learner status”, where the former is an unobserved variable and the latter is an observed variable [Buddin].
Missing Data. The tests have to find a way of contending with missing data: not all kids take the relevant tests, and some test data does not link the students’ records to the appropriate teachers. While there are techniques to address missing data, bias will be introduced if the missing data is correlated to something unobserved or unaccounted for in the model. Suppose, for example, that kids who do poorly on a test in fifth grade are more likely to be absent on test day for sixth grade. If we fill in a teacher’s score-sheet with grades from other kids who did take the test, we might suggest that the teacher did a better job than she did. The challenge of filling in missing data is that we don’t know how well the kids who missed the tests might have done had they taken it – and we cannot assume that on average they would have done the same as those who did take it. If enough data is missing, in statistical terms, the models have increased error. This means we are more likely to have a spurious assessment of a teacher if we only judge her by a few students. While this seems like a small problem, in some areas where populations are more transient and skipping school is more common, it can have a large impact on the model. The less data, the less certainty we have about individual teacher’s contribution.
Validity of data. As discussed above, value-added models are entirely based on student test scores. These tests at best measure accomplishment and progress in a subset of standards established by the state for each grade level. Even if these tests accurately measure this progress, the fact that they do not measure the whole of what we consider “good teaching” is one reason that VAM without other, independent evaluation is a poor measure of a teacher’s success or failure. By construction, at best, VAM measures a teacher’s ability to improve test scores, which means the teacher’s ability to teach that subset of learning goals that are tested.
Independent variables. One of the consistent assumptions is that a teacher’s effect is essentially the same for all students, and that what happens with one student does not affect the result of another student. Many teachers have observed something different: there is a dynamic to a classroom, and students have a lot of influence over one another (this dynamic may be attributable to the teacher, but this is something to be determined empirically). Also, for some students the teacher plays a larger role in progress than for other students. Yet the models do not account for these differences. Similarly, the models tend to assume that school effects are independent of teacher effects. In other words, if a teacher were to teach in another school, they would thrive or falter (compared to colleagues, measured in terms of resulting test scores of students) equally well in another school. This assumption also may not be empirically established.
Persistence of effect. Some models assume that the teacher’s effect persists indefinitely into the future. Others assume that the impact of previous teachers tapers off over time. Again, the question of whether these assumptions are true is an important one, and is not addressed in the model itself. Each model simply makes an assumption – which may be extremely important for how we evaluate a kid’s current teacher. For example, should a child’s second grade teacher get the credit for how well the child does in fourth grade? If so, how much of the credit?
Multiple subjects per grade, multiple school systems and multiple classes of students. Models may assume that within a school system, teachers, students and school effects are exchangeable. This assumption falls apart when data is pooled over several school systems (or perhaps over several schools within one school system). This is precisely because the models assume that some variables are fixed, and in fact they may vary among different schools.
Test timing and testscoring. VAM must deal with tests done in several different grades and compare scores across grades, as well as the fact that different schools may test at different times. For example, how should we compare the resulting test scores of fourth graders for Mrs. Johnson with the sixth grade test scores for Mr. Jackson? One might answer that the percentile improvement would be the best measurement – but if the scores were more widespread for fourth than for sixth grade, it would be “easier” to have a percentage gain in fourth than in sixth. A categorical judgment must be made in the model to make these comparisons, and to some it may have an element that it arbitrary.
Can VAM evaluation be gamed? Suppose that all pay raises and bonuses were based on a VAM in a particular district. Would there be a “best type” of student to pick, in order to have the most favorable outcome?
Unfortunately, there would be. As with using absolute measures to judge teachers (such as how many kids passed the state NCLB tests), it would not be in a teacher’s interest to teach the weakest students, as their progress may be great and yet go unmeasured by a standardized tests. Similarly for the strongest students. Someone interested in gaming the system would pick the most middle-of-the-road kids he could.
What Are the Redeeming Features of VAM?
Despite its limitations, VAM has some advantages over some other simple and statistical methods that might be used to evaluate teachers. It has the immediate advantage of not being arbitrary or immediately political the way, for example, an assessment for a principal coming to visit a class might be. It is resistant to bias due to self-interest, such as peer-to-peer evaluations and self-regulation.
It is also a better judge of teachers’ contributions to student progress than, for example, the percentage of students that meet a minimal benchmark (e.g. passing a NCLB test), as it does not have immediate bias against teachers who teach underprepared children. For these reasons, VAM could be a welcome contribution as part of an evaluation of teachers.
Testing is here to stay, and states are scrambling to create better tests that more accurately reflect knowledge. National standards are being developed, spearheaded by the National Governors Association and the Council of Chief State School Officers. We need to find constructive ways of using the data that comes in – and despite the problems with NCLB data, it does reflect something about what teachers are doing. A teacher will not score highly on a value-added model without doing a great job with her students. And someone whose kids are consistently doing worse than they had previously likely needs additional support (or to be shown the door). But for most teachers, VAM is likely only a small part of the story of their classrooms, their successes and their failures.
Just as only a poor teacher would grade his students based solely on their performance on NCLB exams, only a newspaper would judge teachers based solely by their score on a value-based model. (The good news is that some news sources such as Time explained the problem of a partial evaluation, without too much technical detail).The problem with The LA Times is that they presented a value-added model as if it told the story of the best and the worst teachers; and because it brings along a ranking, it may well take on a life of its own. Unfortunately, many factors that are an important part of good teaching are not measured, and the VAM method of evaluation inherently contains a fair amount of bias and error. Observations, interviews and traditional professional evaluation must be part of any comprehensive and high-stakes evaluation of teacher performance.
Special thanks to Patrick McKnight, Assistant Professor of Psychology at George Mason University for contributing his expertise to this article.
Rebecca Goldin is an Associate Professor of Mathematics at George Mason University, and the Director of Research at STATS. She is also Chair of the Science Policy Committee of the American Mathematical Society and a member of the American Statistical Association, the Mathematical Association of America, and the Association for Women in Mathematics.
Buddin, Richard. How Effective Are Los Angeles Elementary Teachers and Schools? Published on line by the Los Angeles Times, August, 2010.
Braun, Henry. Using Student Progress to Evaluate Teachers: a Primer on Value-Added Models. Policy Information Perpsective, ETS. September, 2005.
McCaffrey, Daniel, Lockwood, J.R., Koretz, Daniel, Louis, Thomas, and Hamilton, Laura. Models for Value-Added modeling of Teacher Effects.Journal of educational and Behavioral Statistics, Vol 29, No. 1l, Value Added Assessment Special Issue, 67-101, Spring, 2004.
McCaffrey, D., Sass, T., Lockwood, J., and Mihaly, K, The inter-Temporal Variability of Teacher Effects Estimates, Education Finance and Policy, Vol. 4, No. 4, 572-606, Fall, 2009.