The correlation between random temperatures and Value Added Scores

Posted on November 2, 2012


I think most people would agree that it makes sense to evaluate people on the job they are doing. I think it is safe to say most people appreciate valuable feedback and constructive criticism. (Well probably not my wife, or myself so much, but stick with me here.) Ideally, if you are going to be on the receiving end of a critique we would all prefer that the evaluation is both impartial and accurate and that it is delivered in such a way that it helps us improve our work product. Louisiana has adopted a measure of evaluating teachers called Value Added Modeling or VAM for short. This is a very complex system of evaluating teacher performance through the performance of their students – on a few specific subjects and tests. If you’re really bored and really statistically savvy you can review a self-analysis of this system by its creators at the end of this post. If you’d like to defer to my analysis just keep reading.

In layman’s terms, the idea behind this system is trying to figure out what a student would have scored with a shitty teacher, average teacher and awesome teacher, based on their previous performance. Once we figure out what those expected scores should be for each student, we check their actual scores and link those results to their actual teacher(s). If a student scores in the awesome range, their teacher must be awesome – by association.

If they score in the less than awesome range well then their teacher or at least their abilities look something like this.

So in short. . . If this evaluation system was just based on raw student scores teachers with the good students would be awesome. Teachers with the bad students would be crap. To equalize the playing field a Child Psychologist named George Noell developed a mathematical model that is supposed to factor in the quality of a teacher’s raw material. Using historical data, trends and complicated averaging methods he extrapolated what a given student’s average score should be if they were given an average teacher. If a student does better than this “expected average score increase” you are a better than average teacher. If the student does worse, you negatively impacted your student with your crapitude. Congratulations.

Scores and tests have all sorts of different ranges, so to make the scores looks comparable through the ages Noell employs some mathematical tricks to normalize the scores with a mean average in the 300 range.

In example: Add test scores 100+150+80 = 330. Divide sum by number of scores 330/3 = 110 to get mean. Now divide 300 by 110 = 2.72 as a multiplying factor. Multiply all scores by the 2.72 factor and now our scores become 272, 408 and 218 on where a 300 is considered average for all test scores. (for a given subgroup)

“Additional work was conducted to complete the datasets. Student achievement scores were re-standardized to mean of 300 and standard deviation of 50 across grade and promotional paths. These values were selected because they closely approximate the typical mean and standard deviation of Louisiana’s assessments across grades and years.” (from Feb Final Value added report)

No biggie there. That was just done to make the numbers look similar and make them easier to graph and to maybe confuse a few people. Now to account for special circumstances known to have a relationship to test scores Noell ran some comparisons using groups of students with say, severe mental disabilities to see if his three year projection model was as good at predicting outcomes as other demographic factors

Indicator codes were used to identify students who were identified as members of the following special education disability groups: emotionally disturbed, specific learning disabled, mildly mentally disabled, speech/language disabled, other health impaired, or other special education disability. Additionally, indicator codes were used for limited English proficiency, Section 504 status, gender, receive free lunch, receive reduced lunch, and ethnicity classification (each ethnic category received its own indicator code).

He found that in some cases and for some demographics they seemed pretty comparable. He also noticed some fairly significant differences. For instance:

The implication of removing special education disabilities information is more substantial. For some teachers, the change in estimate would be large. The proportion of teachers for whom the change will have an impact (small or large) is much greater than for any other variable considered. Finally and most importantly, the impact of excluding this variable will be highly systematic in that it will primarily impact teachers with a high proportion of students with disabilities.

What Noell also theorized based on his data sample would seem pretty obvious to most people, but I think it bears repeating. Quality of data matters. Bad data means inaccurate, perhaps even opposite results.

It is important to note that the first full statewide deployment of the CVR occurred in spring 2010. The comparative analyses between years described below are based on unverified rosters for 2007-2008 and 2008-2009. It is the authors’ hypothesis that when two years of verified rosters are available, the relationship between consecutive years may be strengthened as error variance associated with inaccurate student-teacher links is removed.

I think it’s worth noting that we should expect the quality of this data to be getting worse, not better. This is in no small part due to the failure of the Louisiana Department of Education to maintain data validation and collection staff at the same levels they had in the years of this pilot. In 2010 Data Management had a staff of approximately 12 people. This staff has been reduced due approximately 3 people (75%) to handle even more systems and data collections –and the defections includes all senior staff for these systems. The Accountability and testing area has been similarly impacted and as a result much of the raw testing data used to assign these scores is highly suspect. (Feel free to ask any school district’s testing and accountability liaison to verify a dramatic decrease in quality and timeliness of testing data.)

Now this brings us to the actual accuracy (or stability) of the model employed by Noell. According to an evaluation performed by Wayne Free, Assistant Executive Director, Louisiana Association of Educators, this model has an error rate (as defined by variable classifications of teachers in different categories based on identical teaching methods but different students) as close to 75%

VAM as it is being used currently has approximately a 75% error rate (73.2% in Math, 77.7% in ELA) at the bottom 10% level and approximately 57% error rate (54.2 in Math, 62.5% in ELA) at the top 10% level based on the Validity numbers in the department’s report.

From: George Noell
Sent: Thu 5/10/2012 10:20 AM
To: Free, Wayne [LA]
Subject: RE: Report to legislative education committees

Answers are below, based on the mathematics data (exact numbers vary slightly between content areas.

1. If a teacher scores in the lowest 10% of the VAM score the first year and does nothing different the next year what is the likelihood they will fall in the lowest 10% the second year and remain “ineffective”.


2. If a teacher scores in the lowest 10 – 20% range of the VAM score the first year and does nothing different the next year what is the likelihood they will fall in the lowest 10% the second year and become “ineffective”.


4. If a teacher scores in the highest 10% of the VAM score the first year and does nothing different the next year what is the likelihood they will fall in the highest 10% the second year and remain “highly effective”.


5. If a teacher scores in the highest 10 – 20% of the VAM score the first year and does nothing different the next year what is the likelihood they will fall in the highest 10% the second year and become “highly effective”.


7. I guess what I’m actually asking is what is the stability range across years based on a 10% differential each year and not the top to bottom analysis given in the report

Numbers are above.

Hope that helps.
George Noell, PhD, BCBA
Department of Psychology
Louisiana State University

Teachers were not given these scores and not counseled on these scores therefore we can conclude they did not alter their teaching methods based on these score results. A teacher can be in the bottom 10% one year and the subsequent year be ranked in the top 10% doing nothing different!  Is this reasonable to assume this is merely based on teaching skills, or is it more likely that this model does not account for enough variables to be reliable for evaluating teachers fairly???? To me, the previous chart provided by Wayne Free shows tragedy and absurdity that is VAM. Despite all the pretty numbers and fancy modeling, the results are not much better than random.

Value Added scores are about as accurate as guessing what the temperature will be next year in Louisiana based only on these statistics.

Louisiana has a relatively constant semitropical climate. Rainfall and humidity decrease, and daily temperature variations increase, with distance from the Gulf of Mexico. The normal daily temperature in New Orleans is 68°F (20°C), ranging from 52°F (11°C) in January to 82°F (28°C) in July. The all-time high temperature is 114°F (46°C), recorded at Plain Dealing on 10 August 1936; the all-time low, –16°F (–27°C), was set at Minden on 13 February 1899. New Orleans has sunshine 60% of the time, and the average annual rainfall (1971–2000) was 64.2 in (163 cm). Snow falls occasionally in the north, but rarely in the south.

Now you know the temperature range. So guess. Of course your guess might be more accurate if I gave you a specific day, time, region, whether it was raining or sunny and your guess would be much better. However like Value Added scores, these are just averages. I can give you all of those information points and something can still happen you hadn’t anticipated. Just recently Super Storm Sandy barreled into the north east bringing deluges and blizzard-like conditions in October If I gave you this type of information 6 months ago and asked you to estimate a temperature in North Carolina (where my father told me it was snowing) Sandy would have thrown off almost everyone’s estimate. You would have been misled by history and averages. This is the same flaws that VAM suffers from – an overreliance on history and averaging.

No prediction model ever invented can account for every variable. There will always be Sandy’s. There will always be students who become injured in car accidents in the middle of a school year, throwing off their 3 year expectation. There will be students with parents going through divorces, bankruptcies, homeless situations and students enduring abuse from their peers, strangers or family members. There will be students who simply decide to go through an anti-social Goth stage when they get to high school for no other reason than they saw something cool on YouTube and there will always be students who transfer from out of state with not historical data to build an accurate projection – even by Noell standards.

Please remember that we are not just talking about numbers. We are talking about tens of thousands of educators that comprise those averages. Most of whom try their very best to do a good job, just like you and me, and many of them are being libeled and labeled as “bad” teachers by an absurdly flawed measurement system that may only accurately identify ineffective teachers 25% of the time. We are talking about hundreds of thousands of students who are more than just test scores. They are real children with real problems and they need their teachers engaged not just in their math and reading scores, but the whole child. (Wouldn’t you think a Child Psychologist would understand that?) Many of these children spend more time with their teachers than their own parents, and many teachers are like another parent or mentor to children in their class, and in their care. Value Added dehumanizes our children and our teachers. We are not the sum of our demographics. We are not a projection on a sterile chart, or lines on a graph. We are not our math and reading scores and you dishonor every good teacher you ever had if you believe VAM for one minute captures everything they were to you and everything they are to your children.

There’s a reason private schools would never allow this in their schools. Why do we allow it in ours?

About these ads