Steven Weinberg, a retired Oakland middle school teacher, recommends a book that exposes grading practices on the writing portion of standardized tests — written by an insider.
One of the delights of retirement is that I finally have enough time to read. This week I discovered a new book that ought to be read by everyone involved with standardized tests — and in today’s environment, that means practically everybody.
“Making the Grades: My Misadventures in the Standardized Testing Industry” by Todd Farley (2009) gives a detailed description of what actually happens to the writing sample portions of standardized tests when they are sent to testing companies for scoring. Although the book is written in an amusing style (the first 150 pages kept me in stitches, and when I read passages aloud to my wife and we both laughed until we could barely talk), the message is serious: Testing companies care only about making a profit and will cut any corner and ignore their own guidelines to do so.
Farley began his career in the testing industry in 1994, earning $10 an hour as a grader. He ended it in 2008 when he was pulling in nearly $100,000 for six-months’ work as a “consultant.” He worked on many state exams (including California’s) and the National Assessment of Educational Progress. Along the way he created writing samples, put together the rubrics used to score them, selected the sample papers to use for training scorers, hired scorers, trained them, and supervised them during the scoring.
The one area in which he had no experience was firing scorers, because, as he vividly shows in the book, no scorer was ever so incompetent that the testing industry could afford to fire them and slow down the scoring process. Farley explains how he learned from more experienced colleagues how to falsify scoring data to make it appear that all scorers were doing an acceptable job.
Farley tells of test scorers who arrived each morning hung-over, and didn’t make much progress on their daily quotas until they had taken several “fortified” coffee breaks, test scorers who were far more interested in studying for the bar exam than in carefully reading the papers they were scoring, and foreign-born scorers who could not understand simple English.
These stories are funny until one realizes that students’ fates hung in the balance. The results of some of these tests, like the California High School Exit Exam, may determine a young person’s future. Others, like the fourth and seventh grade writing samples of the California Standards Tests, may help decide which schools are closed and which stay open.
Even sadder than the lack of competence of many $10/hour scorers is the complicity of state departments of education in these farces. Instead of demanding that testing companies score accurately, the state departments themselves intervene to reshape student scores to meet their preconceived ideas of what the scores should look like. Farley tells of one state official who does not like the distribution of scores and orders a change in the rubric in the middle of the scoring (half the papers were judged by one set of rules, the second half by another). For everyone in the process, deadlines are far more important than quality.
My own experiences examining the results of standardized tests confirm the main points of Farley’s criticism. In the early years of the California Standards Tests, the fourth and seventh grade writing samples were scored by two different scorers. Both scorers were using the same 4-point rubric, and the scores from the two were added together. If they both scored the paper a 3, it would get 6 points. If they both thought it was a 2, it would get 4 points. If one thought it was a 3, and the other thought it was a 2, the paper received 5 points. If the scorers were trained well on a clear rubric, you would expect that most of the test scores would be even, scores that show that both scorers agreed. In fact, there were a huge number of 3’s, 5’s, and 7’s, which showed that the scoring was not consistent.
Several years ago the state decided to save money and eliminate the second scoring of each paper. Now only one scorer grades a paper, and that score is multiplied by 2. The inconsistency of scorers probably remains, but there is no way to see it anymore. Now some students are either rewarded by 2 extra points if they have an easy scorer or penalized by 2 points if they have a strict one. It is easy to say that a few points here and there really don’t matter, but there are students denied diplomas based on a single point. Several years ago one of the largest high schools in the Oakland was moved to a more punitive level of Program Improvement, which would have been avoided if only two students had received higher scores.
In California it is illegal for anyone at schools to read or score student work on writing samples that are part of the CST or Exit Exam. It is illegal to copy the student writing, and the student writing is never returned to the student, the parents, or the school. This is justified as a security measure, but since the topics students write on are not reused, that cannot be the real reason. This regulation exists solely to prevent anyone from being able to question the scores given by the testing company.
When I taught, I was accountable for the accuracy of the grade I gave every assignment or test. Papers were always returned, and any student or parent could challenge the grade and demand an explanation. Why should testing companies be protected from this kind of accountability?
For those of us who have worked in schools and have been charged with improving student scores, the most horrifying part of the book is Chapter 6, in which Farley tells of being part of a group of experts creating the scoring guide for the 1998 National Assessment of Educational Progress (NAEP). These experts developed a scoring guide and used it to score student papers for a week. At that point they were told that the psychometricians (those with degrees in statistics) reviewing their work decided they were not giving enough low scores.
These psychometricians did not read a single paper. They reached their conclusion by comparing the distribution of scores to the distribution of scores they expected to see when the tests were written. Was there an outcry from the English experts who had actually read and scored the exams? No. As Farley described it, they quickly adjourned to re-score the papers to match the distribution the stat people wanted. In other words, our education system is being judged by the results of tests on which the distribution of scores is created by statisticians before a single test answer is read.
Using that same logic, the officials at the Super Bowl would have ignored the score of the game and awarded the trophy to the Colts because that is what everyone predicted.
Making the Grades reveals some shocking truths about how we are judging schools and students; Farley’s book deserves a wide readership.