I think most people would agree that it makes sense to evaluate people on the job they are doing. I think it is safe to say most people appreciate valuable feedback and constructive criticism. (Well probably not my wife, or myself so much, but stick with me here.) Ideally, if you are going to be on the receiving end of a critique we would all prefer that the evaluation is both impartial and accurate and that it is delivered in such a way that it helps us improve our work product. Louisiana has adopted a measure of evaluating teachers called Value Added Modeling or VAM for short. This is a very complex system of evaluating teacher performance through the performance of their students – on a few specific subjects and tests. If you’re really bored and really statistically savvy you can review a self-analysis of this system by its creators at the end of this post. If you’d like to defer to my analysis just keep reading.

In layman’s terms, the idea behind this system is trying to figure out what a student would have scored with a shitty teacher, average teacher and awesome teacher, based on their previous performance. Once we figure out what those expected scores should be for each student, we check their actual scores and link those results to their actual teacher(s). If a student scores in the awesome range, their teacher must be awesome – by association.

If they score in the less than awesome range well then their teacher or at least their abilities look something like this.

So in short. . . If this evaluation system was just based on raw student scores teachers with the good students would be awesome. Teachers with the bad students would be crap. To equalize the playing field a Child Psychologist named George Noell developed a mathematical model that is supposed to factor in the quality of a teacher’s raw material. Using historical data, trends and complicated averaging methods he extrapolated what a given student’s average score should be if they were given an average teacher. If a student does better than this “expected average score increase” you are a better than average teacher. If the student does worse, you negatively impacted your student with your crapitude. Congratulations.

Scores and tests have all sorts of different ranges, so to make the scores looks comparable through the ages Noell employs some mathematical tricks to normalize the scores with a mean average in the 300 range.

In example: Add test scores 100+150+80 = 330. Divide sum by number of scores 330/3 = 110 to get mean. Now divide 300 by 110 = 2.72 as a multiplying factor. Multiply all scores by the 2.72 factor and now our scores become 272, 408 and 218 on where a 300 is considered average for all test scores. (for a given subgroup)

“Additional work was conducted to complete the datasets. Student achievement scores were re-standardized to mean of 300 and standard deviation of 50 across grade and promotional paths. These values were selected because they closely approximate the typical mean and standard deviation of Louisiana’s assessments across grades and years.” (from Feb Final Value added report)

No biggie there. That was just done to make the numbers look similar and make them easier to graph and to maybe confuse a few people. Now to account for special circumstances known to have a relationship to test scores Noell ran some comparisons using groups of students with say, severe mental disabilities to see if his three year projection model was as good at predicting outcomes as other demographic factors

Indicator codes were used to identify students who were identified as members of the following special education disability groups: emotionally disturbed, specific learning disabled, mildly mentally disabled, speech/language disabled, other health impaired, or other special education disability. Additionally, indicator codes were used for limited English proficiency, Section 504 status, gender, receive free lunch, receive reduced lunch, and ethnicity classification (each ethnic category received its own indicator code).

He found that in some cases and for some demographics they seemed pretty comparable. He also noticed some fairly significant differences. For instance:

The implication of removing special education disabilities information is more substantial. For some teachers, the change in estimate would be large. The proportion of teachers for whom the change will have an impact (small or large) is much greater than for any other variable considered. Finally and most importantly, the impact of excluding this variable will be highly systematic in that it will primarily impact teachers with a high proportion of students with disabilities.

What Noell also theorized based on his data sample would seem pretty obvious to most people, but I think it bears repeating. Quality of data matters. Bad data means inaccurate, perhaps even opposite results.

It is important to note that the first full statewide deployment of the CVR occurred in spring 2010. The comparative analyses between years described below are based on unverified rosters for 2007-2008 and 2008-2009. It is the authors’ hypothesis that when two years of verified rosters are available, the relationship between consecutive years may be strengthened as error variance associated with inaccurate student-teacher links is removed.

I think it’s worth noting that we should expect the quality of this data to be getting worse, not better. This is in no small part due to the failure of the Louisiana Department of Education to maintain data validation and collection staff at the same levels they had in the years of this pilot. In 2010 Data Management had a staff of approximately 12 people. This staff has been reduced due approximately 3 people (75%) to handle even more systems and data collections –and the defections includes all senior staff for these systems. The Accountability and testing area has been similarly impacted and as a result much of the raw testing data used to assign these scores is highly suspect. (Feel free to ask any school district’s testing and accountability liaison to verify a dramatic decrease in quality and timeliness of testing data.)

Now this brings us to the actual accuracy (or stability) of the model employed by Noell. According to an evaluation performed by Wayne Free, Assistant Executive Director, Louisiana Association of Educators, this model has an error rate (as defined by variable classifications of teachers in different categories based on identical teaching methods but different students) as close to 75%

VAM as it is being used currently has approximately a 75% error rate (73.2% in Math, 77.7% in ELA) at the bottom 10% level and approximately 57% error rate (54.2 in Math, 62.5% in ELA) at the top 10% level based on the Validity numbers in the department’s report.

From: George Noell
Sent: Thu 5/10/2012 10:20 AM
To: Free, Wayne [LA]
Subject: RE: Report to legislative education committees

Wayne,
Answers are below, based on the mathematics data (exact numbers vary slightly between content areas.

1. If a teacher scores in the lowest 10% of the VAM score the first year and does nothing different the next year what is the likelihood they will fall in the lowest 10% the second year and remain “ineffective”.

26.8%

2. If a teacher scores in the lowest 10 – 20% range of the VAM score the first year and does nothing different the next year what is the likelihood they will fall in the lowest 10% the second year and become “ineffective”.

14.8%

4. If a teacher scores in the highest 10% of the VAM score the first year and does nothing different the next year what is the likelihood they will fall in the highest 10% the second year and remain “highly effective”.

45.8%

5. If a teacher scores in the highest 10 – 20% of the VAM score the first year and does nothing different the next year what is the likelihood they will fall in the highest 10% the second year and become “highly effective”.

22.1%

7. I guess what I’m actually asking is what is the stability range across years based on a 10% differential each year and not the top to bottom analysis given in the report

Numbers are above.

Hope that helps.
George
_______________________
George Noell, PhD, BCBA
Professor
Department of Psychology
Louisiana State University

Teachers were not given these scores and not counseled on these scores therefore we can conclude they did not alter their teaching methods based on these score results. A teacher can be in the bottom 10% one year and the subsequent year be ranked in the top 10% doing nothing different!  Is this reasonable to assume this is merely based on teaching skills, or is it more likely that this model does not account for enough variables to be reliable for evaluating teachers fairly???? To me, the previous chart provided by Wayne Free shows tragedy and absurdity that is VAM. Despite all the pretty numbers and fancy modeling, the results are not much better than random.

Value Added scores are about as accurate as guessing what the temperature will be next year in Louisiana based only on these statistics.

Louisiana has a relatively constant semitropical climate. Rainfall and humidity decrease, and daily temperature variations increase, with distance from the Gulf of Mexico. The normal daily temperature in New Orleans is 68°F (20°C), ranging from 52°F (11°C) in January to 82°F (28°C) in July. The all-time high temperature is 114°F (46°C), recorded at Plain Dealing on 10 August 1936; the all-time low, –16°F (–27°C), was set at Minden on 13 February 1899. New Orleans has sunshine 60% of the time, and the average annual rainfall (1971–2000) was 64.2 in (163 cm). Snow falls occasionally in the north, but rarely in the south.

Now you know the temperature range. So guess. Of course your guess might be more accurate if I gave you a specific day, time, region, whether it was raining or sunny and your guess would be much better. However like Value Added scores, these are just averages. I can give you all of those information points and something can still happen you hadn’t anticipated. Just recently Super Storm Sandy barreled into the north east bringing deluges and blizzard-like conditions in October If I gave you this type of information 6 months ago and asked you to estimate a temperature in North Carolina (where my father told me it was snowing) Sandy would have thrown off almost everyone’s estimate. You would have been misled by history and averages. This is the same flaws that VAM suffers from – an overreliance on history and averaging.

No prediction model ever invented can account for every variable. There will always be Sandy’s. There will always be students who become injured in car accidents in the middle of a school year, throwing off their 3 year expectation. There will be students with parents going through divorces, bankruptcies, homeless situations and students enduring abuse from their peers, strangers or family members. There will be students who simply decide to go through an anti-social Goth stage when they get to high school for no other reason than they saw something cool on YouTube and there will always be students who transfer from out of state with not historical data to build an accurate projection – even by Noell standards.

Please remember that we are not just talking about numbers. We are talking about tens of thousands of educators that comprise those averages. Most of whom try their very best to do a good job, just like you and me, and many of them are being libeled and labeled as “bad” teachers by an absurdly flawed measurement system that may only accurately identify ineffective teachers 25% of the time. We are talking about hundreds of thousands of students who are more than just test scores. They are real children with real problems and they need their teachers engaged not just in their math and reading scores, but the whole child. (Wouldn’t you think a Child Psychologist would understand that?) Many of these children spend more time with their teachers than their own parents, and many teachers are like another parent or mentor to children in their class, and in their care. Value Added dehumanizes our children and our teachers. We are not the sum of our demographics. We are not a projection on a sterile chart, or lines on a graph. We are not our math and reading scores and you dishonor every good teacher you ever had if you believe VAM for one minute captures everything they were to you and everything they are to your children.

There’s a reason private schools would never allow this in their schools. Why do we allow it in ours?

5 thoughts on “The correlation between random temperatures and Value Added Scores

  1. I don’t get the math and don’t want to waste the time it would take for me to get my PhD in math or statistical research in order to do so. But my experience teaching gifted 7th graders is enough for me to know that VAM smells like a cow patty.

    Now that I have your insightful and graphic analysis, I can approach our King for a Day Superintendent White, BESE members, and legislators at a level even they can understand with an analogy they are familiar with (cow patties).

    One more request though. Can you transform this blog into a PowerPoint or YouTube presentation? I would like to share with friends and foe nationwide! If not, can I have your permission to do so myself? Thanks

  2. You may do with it what you will. I just ask that you fix my typos and have a math/stat person check my math. 🙂 I tried to simplify a few of the key parts but would hate if a simple term misuse invalidated a presentation.

    This is just one angle of critique. There are several other directions this can be debunked from that I may have only touched on briefly.

    All else equal except quality if data and student composition – scores changed quite a bit. You want to look at top and bottom changes in particular. This strongly suggests those factors are being identified, not a teachers ability.

    No independent verification of whether teachers udentified as good or bad were good or bad as ranked by another independent metric. (Were teachers identified as good or bad identified as such by principals, peers and students?)

    System was developed to identify program effectiveness based on trends and averages. Data was known to be bad on teacher basis but relied on larger sample size to obscure erroneous results. Those erroneous results are mislabled people.

    High achieving students have limited opportunities for “growth” (i.e. teachers can’t add beyond 100%.) This can label teachers if high performing students as bad because they can’t be good by definition. This is noted in Noell’s work and annecdotally verified by numerous recent articles.

    Really bad/disabled students can bring down entire classes score. (Noted by Noelle)

    Teachers have no control over data quality. When I brought up this issue are Doe i was told districts needed to send good data or it they deserved what they got. And this would teach them. The ones who get burned are the teachers not the ones submitting data.

    Shadow schools are being used to hide studenst unbeknownst to teachers.

    Teachers can verify rosters during CVR correction period but not all demographics included in calculation – Only students and class codes.

    Many districts submit incomplete attendance and discipline data (used as factors) and LDE does not address.

    Poverty and ethnicity appear to be excluded as factors for political reasons but they may be strongest indicators of future performance.

    Wealth and quality of school districts not considered. Some districts are terrible and some are awesome due to funding and facilities. Students transferring to different districts may experience significant growth or decline due to environment.

    Etc. There are so many issues and flaws its hard to capture and describe them all in a single post.

  3. I’m a statistician, and I looked at this in just enough detail to see that it seems to be a case of giving way to much importance to an imprecise measure. It sounds like the people who came up with the VAM thought hard about it, but couldn’t account for all complexities. Those complexities can be very important to individual teachers, and to the system as a whole. One of the biggest errors made with statistics is to “look for your keys under the streetlamp, because that’s where there’s light.” We measure what is easy to measure and then assume that it measures what we wished we could measure.

    I’m not against teacher evaluation, but I am against bad teacher evaluation…

    1. Thanks for stopping by and commenting Alan.

      I think that is what I, and millions of teachers are trying to say. There are some things that are easy to measure and some things that are immeasurable or unknowable and many in between. This system might be suitable for making some broad generalizations (that’s actually what it was originally designed to do.)

      Not all things can be accounted for in complex systems – like human beings. Rather than admit that they developed the best system they could come up with, which still fell far short of the lofty goals they wanted to achieve, they claimed this was the holy grail of evaluation systems and that all teachers can be sorted out as easy as a few different colored M&Ms in a small box.

      And now millions of teachers and students will suffer and be the worse for it.

Leave a comment