Book review: Making the Grades

Steven Weinberg, a retired Oakland middle school teacher, recommends a book that exposes grading practices on the writing portion of standardized tests — written by an insider.

Steven Weinberg

One of the delights of retirement is that I finally have enough time to read. This week I discovered a new book that ought to be read by everyone involved with standardized tests — and in today’s environment, that means practically everybody.

“Making the Grades: My Misadventures in the Standardized Testing Industry” by Todd Farley (2009) gives a detailed description of what actually happens to the writing sample portions of standardized tests when they are sent to testing companies for scoring. Although the book is written in an amusing style (the first 150 pages kept me in stitches, and when I read passages aloud to my wife and we both laughed until we could barely talk), the message is serious: Testing companies care only about making a profit and will cut any corner and ignore their own guidelines to do so.

Farley began his career in the testing industry in 1994, earning $10 an hour as a grader. He ended it in 2008 when he was pulling in nearly $100,000 for six-months’ work as a “consultant.” He worked on many state exams (including California’s) and the National Assessment of Educational Progress. Along the way he created writing samples, put together the rubrics used to score them, selected the sample papers to use for training scorers, hired scorers, trained them, and supervised them during the scoring.

The one area in which he had no experience was firing scorers, because, as he vividly shows in the book, no scorer was ever so incompetent that the testing industry could afford to fire them and slow down the scoring process. Farley explains how he learned from more experienced colleagues how to falsify scoring data to make it appear that all scorers were doing an acceptable job.

Farley tells of test scorers who arrived each morning hung-over, and didn’t make much progress on their daily quotas until they had taken several “fortified” coffee breaks, test scorers who were far more interested in studying for the bar exam than in carefully reading the papers they were scoring, and foreign-born scorers who could not understand simple English.

These stories are funny until one realizes that students’ fates hung in the balance. The results of some of these tests, like the California High School Exit Exam, may determine a young person’s future. Others, like the fourth and seventh grade writing samples of the California Standards Tests, may help decide which schools are closed and which stay open.

Even sadder than the lack of competence of many $10/hour scorers is the complicity of state departments of education in these farces. Instead of demanding that testing companies score accurately, the state departments themselves intervene to reshape student scores to meet their preconceived ideas of what the scores should look like. Farley tells of one state official who does not like the distribution of scores and orders a change in the rubric in the middle of the scoring (half the papers were judged by one set of rules, the second half by another). For everyone in the process, deadlines are far more important than quality.

My own experiences examining the results of standardized tests confirm the main points of Farley’s criticism. In the early years of the California Standards Tests, the fourth and seventh grade writing samples were scored by two different scorers. Both scorers were using the same 4-point rubric, and the scores from the two were added together. If they both scored the paper a 3, it would get 6 points. If they both thought it was a 2, it would get 4 points. If one thought it was a 3, and the other thought it was a 2, the paper received 5 points. If the scorers were trained well on a clear rubric, you would expect that most of the test scores would be even, scores that show that both scorers agreed. In fact, there were a huge number of 3’s, 5’s, and 7’s, which showed that the scoring was not consistent.

Several years ago the state decided to save money and eliminate the second scoring of each paper. Now only one scorer grades a paper, and that score is multiplied by 2. The inconsistency of scorers probably remains, but there is no way to see it anymore. Now some students are either rewarded by 2 extra points if they have an easy scorer or penalized by 2 points if they have a strict one. It is easy to say that a few points here and there really don’t matter, but there are students denied diplomas based on a single point. Several years ago one of the largest high schools in the Oakland was moved to a more punitive level of Program Improvement, which would have been avoided if only two students had received higher scores.

In California it is illegal for anyone at schools to read or score student work on writing samples that are part of the CST or Exit Exam. It is illegal to copy the student writing, and the student writing is never returned to the student, the parents, or the school. This is justified as a security measure, but since the topics students write on are not reused, that cannot be the real reason. This regulation exists solely to prevent anyone from being able to question the scores given by the testing company.

When I taught, I was accountable for the accuracy of the grade I gave every assignment or test. Papers were always returned, and any student or parent could challenge the grade and demand an explanation. Why should testing companies be protected from this kind of accountability?

For those of us who have worked in schools and have been charged with improving student scores, the most horrifying part of the book is Chapter 6, in which Farley tells of being part of a group of experts creating the scoring guide for the 1998 National Assessment of Educational Progress (NAEP). These experts developed a scoring guide and used it to score student papers for a week. At that point they were told that the psychometricians (those with degrees in statistics) reviewing their work decided they were not giving enough low scores.

These psychometricians did not read a single paper. They reached their conclusion by comparing the distribution of scores to the distribution of scores they expected to see when the tests were written. Was there an outcry from the English experts who had actually read and scored the exams? No. As Farley described it, they quickly adjourned to re-score the papers to match the distribution the stat people wanted. In other words, our education system is being judged by the results of tests on which the distribution of scores is created by statisticians before a single test answer is read.

Using that same logic, the officials at the Super Bowl would have ignored the score of the game and awarded the trophy to the Colts because that is what everyone predicted.

Making the Grades reveals some shocking truths about how we are judging schools and students; Farley’s book deserves a wide readership.

Katy Murphy

Education reporter for the Oakland Tribune. Contact me at kmurphy@bayareanewsgroup.com.

  • http://www.peterrichardson.blogspot.com Peter Richardson

    I acquired and edited this book for PoliPointPress and am pleased that you recognized its contribution–and even extended its analysis with your own firsthand experiences.

    By coincidence, the agent who brought it to us, Andy Ross, is an Oakland resident. (He used to own and operate Cody’s Books.)

  • Steven Weinberg

    Thank you, Peter, and Andy Ross, for helping get this important book published and distributed.

  • Alice Spearman

    Mr. Weinberg is so right. Now that we know what goes on when the tests are to be read maybe we can coach our students a little differently. I am going to recomend this book to the superintendent to read and also all the principals in my district. It might help.

  • del

    Ms. Spearman, people in education know what’s going on. It is the public and their representatives that need to know this and demand change. We in education are the ones working hard, yet we are reviled and our schools are called “failing” based on psychometricians manipulating the data.

  • http://perimeterprimate.blogspot.com/ Sharon Higgins

    Georgia Schools Inquiry Finds Signs of Cheating, NY Times, 2/12/10

    ATLANTA — Georgia education officials ordered investigations on Thursday at 191 schools across the state where they had found evidence of tampering on answer sheets for the state’s standardized achievement test…


  • Nextset

    Very interesting article on a very big problem. CA runs the State Bar exam twice a year in concert with the other states’ Bar Exams. When I came through Half of the test is multiple choice, half are one hour essays. Now there is a practical exam which is a small part of the test. You can’t get sworn in until you also pass a background review and an ethics test which are managed in different settings. There are a number of protocols used to avoid cheating and protect the test’s reputation. Fingerprinting of the test taker, test numbers and no names on the test books and sheets, second readings on some essays depending on the proposed score, and all materials returnable to anyone who fails. If you pass, you are entitled to nothing, don’t ask. I forgot if wrong answers on the multiple choice test get penalty points. That was done in school exams to discourage guessing without a clue.

    Of course each applicant pays hundreds of dollars for a test. So that funds all the protocols.

    Despite the sue-happy personalities involved I haven’t heard much of a hue and cry that the test isn’t fair or fairly administered. II believe it’s operation uses state of the art testing science including experimental questions embedded in the multiple choice exam that are not actually scored and the ability to remove a question after administration of the exam if it is decided that the “correct” answer is deemed too much in dispute.

    One thing I am sure of is that if you allow the schools themselves to in any way administer high stakes testing of their students you are in no position to complain about cheating – you’ve set it up to have cheating. High Stakes testing must be independent of the teachers and administrators of the students involved to have any hope of a fair test.

    Europe does high-stakes testing of high school students. Can anybody here tell us the difference between their testing protocols and ours in CA? Or in the other USA states.

  • Jim Mordecai


    Your statement, I believe, is an important truth in regards how High Stakes testing is conducted: “One thing I am sure of is that if you allow the schools themselves to in any way administer high stakes testing of their students you are in no position to complain about cheating – you’ve set it up to have cheating. High Stakes testing must be independent of the teachers and administrators of the students involved to have any hope of a fair test.”

    I say amen to what you say would be required to have a fair test. Because High Stakes testing is not independently administered, you have a state such as Georgia finding that there is a whole lot of cheating going on. Currently, States do not provide much oversight on testing but are making a whole lot of High Stakes decisions. The reason laxity toward High Stakes cheating continues is economics and it will cost a lot of money to administered High Stakes tests independently.

    It use to be that most cheating was mostly done by students; but, it is a new day and the adults are most likely leaving students behind in the number of cheaters as their stake is often higher than of students.

    Jim Mordecai

  • Nextset

    Jim Mordecai: The state is designing the system for cheating, wants cheating, expects cheating. Not just in high stakes testing but in many other high-stakes number games. I have friends – lawyers – who tell me they believe San Francisco has been rigging municipal voting for some time. I don’t discount their beliefs, they were involved with candidates/elections there. We’ve all heard the stories about the Mafia voting the Chicago Graveyards for the Kennedy Presidential Election. CA & the US has a laughable motor-voter system and mail in ballot system combined with failure to require proof of identity at the polls or proof of nationality at the registrar’s office. There is no intention of making voting honest. We pretend to vote and they pretend not to take bribes I guess.

    A true voting system would start fresh re-doing the registrations under the old method and require re-registering in person with a US Passport, subsequent voting to be in person with the Passport or Driver’s License/State ID. Ballot to be a paper ballot readable by eye although the ballots may be written with the assist of a computer and counted optically by machine. In the event of a dispute ballots could be counted by hand in any particular precinct. This will never happen under the current politicians.

    If anyone wants to have a counting or tally system with high fidelity we all know how to do it. When you hear people with straight faces proposing high stakes vote counting – or anything else – with electronic means which eliminates the paper and a hard to fake audit trial – they are planning to cheat.

    As far as the schools go, multiple choice exams are easier to grade and return the exams or copies of the exams. Essays are far more difficult to grade and standardize and take far longer. It is possible that the schools just can’t afford to use them on a mass exam. Not if they are going to use the expert and labor intensive methods the CA State Bar uses to read and grade it’s essay exam twice a year. And I do note they return all the essays (photocopied) on demand to a failing candidate, along with the model answer for each essay question.

    High Stakes testing fraud is here to stay because the government wants it so. The consumers (employers)of the “products” – the students, are not fooled and they can easily sort candidates. They are hiring for IQ. They use various things as proxies for IQ testing: The school you attended, the program you graduated in, your scores perhaps on standardized tests indepedently administered (SAT, etc). Once they have eliminated the bulk of the wanna-be’s they have a resonable number to interview or to sort further by internship performance. Colleges and grad programs know how to pick an entering class. They can do that by actuarial science no matter how grade inflation is running. Identify factors common to your flunking students and your star students whatever those factors may be (Proxies for race? Gender? Eagle Scout status?) Select and Deselect accordingly, and you have a stable class.

    So the consumers aren’t complaining, the test subjects aren’t complaining, the government is happy when they have desired numbers to report.

    So these tests we are discussing here are all about self delusion. The market isn’t fooled. So maybe no real harm is being done, except to those who don’t want to keep their eyes open. No reason to change anything here I suppose.

    And I don’t trust the government’s unemployment numbers either.

  • Nextset

    One more thing on the Bar essays that might be significant. The Bar doesn’t read every essay. When the person has a high enough score on the first several essays read, and they have done well on the multiple choice scores, it is assumed he or she is a “passer” and they are passed without bothering to read the rest of the essays. By this means the total number of essays that must be graded are sharply reduced.

    When a candidate has borderline scores as the first few essays are read they remain in process and the rest of their essays are worked on. I don’t know if essay reading is stopped when the person is scoring so badly on the multiple choice and/or the first 3 essays so that further grading is pointless. It would make sense to do so if only a pass/fail is relevant, but I believe they still score the loser’s writing till the end. This policy was begun after the experts decumented the correlation between profiles of the winners and the losers and showed that scoring every essay made no difference once the winning (or losing?) profile was certain. And it did save money and time.

    Don’t even get me started on the racial gap here. Go to the State Bar’s website and look at the Bar Exam stats which are posted by race and by school. A friend was part of an old study at a large UC school that reviewed the par passage stats and discovered that when the minority students were removed from the analysis the school’s pass rate went up to nearly all passing. The fail to pass gap the school was experiencing was directly tied to the minority admissions. What to do?

    A non UC Law School I know well flunks 1/3 or it’s graduating class up to 2 weeks before graduation (using “tests”), culling anyone likely to not pass the Bar. Guess who?? It’s pretty brutal. But they have a sensational pass rate now.

    This may sound off the subject but no, it isn’t. All this high stakes testing goes right to the heart of “The Gap”. From DMV’s testing to high school graduation tests to professional licensing exams. If we sharpen the testing the “Gap” becomes fresh topic again. If we fudge the testing it subsides.

    God forbid we deal with the issue of the Gap directly.

    Drave New World.

  • Alanya Snyder

    I enjoyed this review and look forward to hearing what Steve Weinberg is reading next!

  • Steven Weinberg

    Thank you, Alanya. I’m not sure when I’ll write another full book review, but I did just finish “Whatever It Takes” by Paul Tough, about Geoffrey Canada and the Harlem Children’s Zone. It looks at what it would take to provide inner city children with the same level of support that middle class children receive. I was particularly interested in the challenges he faced in opening a middle school, as most of my teaching experience was at that level. (Spoiler Alert: He pulled the plug on his first middle school attempt because he could not generate the results he wanted, despite having an almost unlimited budget. He is now trying to rebuild the middle school with students who went through his pre-school to fifth grade programs. There are not enough data yet to see if this attempt will be successful.
    Note to Nextset: The book contains a discussion of “The Bell Curve,” particularly a critique of the book by James Hickman, who is the first person that Charles Murray lists in the book’s acknowledgements. Hickman thinks the book has much that is valuable, but that the data that it claims measure IQ are actually measures of cognitive ability and achievement, and can be altered by education, and thus, Murray conclusions about the IQs of different groups are not correct.