The Wrong Scoreboard

This is Part 1 of a two-part series. Part 2, "The New Scoreboard," examines what should replace grades and test scores.

This is a story about American education, where the scoreboard is bolted to the wall by federal law and a $39 billion industry. The problems it describes — grading systems that reward compliance over curiosity, standardized tests that a machine can ace without understanding a single question — exist in classrooms around the world. But the laws, the institutions, and the money that keep the wrong scoreboard in place are distinctly American.

Elisabeth Gruner has been teaching college English for more than thirty years. She has taught Victorian literature, children's literature, and creative nonfiction at the University of Richmond, where she directs the First-Year Seminar program and Writing Across the Curriculum. She is, by every institutional metric, a successful professor. Her students complete her courses. Her evaluations are strong. Her university promoted her work in its alumni magazine.

In 2018, she stopped putting grades on student work.

She did not stop assessing. She did not stop reading every sentence her students wrote. She replaced the grade with something older and more demanding: she wrote comments. Specific, diagnostic comments on every draft, with ample opportunity to revise and resubmit. At the end of the semester, students compiled a portfolio of their revised work and wrote a reflective essay evaluating their own learning. Gruner read the portfolios, the goal statements, and the reflections. Then she submitted a final grade, because the university required one.

Her first class was incredulous. "If we ask you," one student said, "will you tell us what grade we have on a paper?" No, Gruner told them. She had not put one on it.^[1]

What happened next should have ended the debate. Students read her comments. For the first time in three decades of teaching, her students actually engaged with the feedback she had spent her career writing. They worked harder. They revised willingly, not because a grade was at stake but because the quality of the work mattered to them. They took risks in their writing that graded students never took, because a risk that fails is a learning experience in an ungraded classroom and a GPA catastrophe in a graded one.^[2]

The practice is called "ungrading," and Gruner's account of it, published in The Conversation in April 2022, was distributed by the Associated Press to more than forty outlets, from PBS NewsHour to Yahoo Finance to Channel News Asia.^[3] Guy Kawasaki interviewed her on his podcast.^[4] The University of Richmond featured her in its magazine, framing ungrading as part of a growing portfolio of alternative assessment practices across the institution, including specifications grading, contract grading, and mastery grading. STEM faculty at Richmond began exploring the same approach. "The point," one professor told the magazine, "is helping students learn the material rather than focusing on their performance on exams."^[5]

Gruner did not discover something new. She uncovered something the system had been hiding: the grade was not measuring learning. It was preventing it.

She had proved the scoreboard was broken. She did not yet know that a machine was about to prove why.

The Score the Machine Can Beat

painting schulexamen anker
Albert Anker, "Das Schulexamen" (1862). Kunstmuseum Bern. The examination as public performance: a child stands before the class, evaluated on what can be observed and measured. The painting has not aged. The practice has not changed. Public domain.

OpenAI released GPT-4 in March 2023 and published its performance on a battery of standardized academic exams. The results were not subtle. GPT-4 scored 1400 out of 1600 on the SAT, placing in the 89th percentile of human test-takers. It passed the Uniform Bar Examination at roughly the 90th percentile, though independent analysis later revised that estimate downward to approximately the 62nd percentile when measured against first-time takers.^[6] It cleared multiple Advanced Placement exams. It performed, by any reasonable definition, as a strong student.

It understood nothing.

GPT-4 is a statistical model. It predicts the next token in a sequence based on patterns in its training data. It does not comprehend the law it passed the bar on. It does not understand the mathematics it scored in the 89th percentile on. It produces outputs that are indistinguishable from understanding, and that indistinguishability is the point. If a machine that does not understand anything can ace the test, the test is not measuring understanding. It is measuring the ability to produce patterns that look like understanding. And that is exactly what we have been grading students on.

The standardized testing industry in the United States is a $1.7 billion market in K-12 alone, embedded in a broader $39 billion ecosystem.^[7] It produces the scores that determine school funding, teacher evaluations, student placements, and college admissions. It is, by any measure, the scoreboard of American education.

A machine just posted a top-decile score on that scoreboard without understanding a single question it answered. The industry's response has been revealing.

The College Board launched a GenAI incubator in 2024. Its public framing was about "safeguarding authentic student learning."^[8] It said nothing about what GPT-4's performance means for the test itself.

ACT's response came closer to the truth. In a December 2024 interview with Education Week, Joanna Gorin, ACT's vice president of design and digital science, said the quiet part almost out loud: "There's incredible promise from AI, and it can potentially get them the kind of information they really want."^[9]

Gorin was not talking about students. She was talking about the institutions that purchase the tests – states, districts, and schools – and she was admitting, almost in passing, that what those institutions actually want is not what the current tests deliver.

If a machine that understands nothing can ace the test, the test is not measuring understanding. It is measuring the ability to produce patterns that look like understanding.

The Law That Cannot Adapt

The Every Student Succeeds Act, signed in 2015, requires every state to administer annual standardized tests in reading and mathematics in grades three through eight and once in high school. Science must be tested at least once in each grade band. States must report results publicly, disaggregated by subgroup: race, income, disability status, English proficiency.^[10]

ESSA replaced No Child Left Behind, which had attached punitive consequences to test scores, including school closure and mandatory restructuring. ESSA softened the penalties. It did not soften the mandate. The tests remain compulsory. The reporting remains compulsory. The infrastructure that administers, scores, and publishes the results remains intact and funded.

No provision in ESSA accounts for the possibility that the tests themselves might become meaningless. The law was written in a world where standardized test performance was a reasonable proxy for student learning. That world ended in March 2023, and the law has not noticed.

The result is a system locked in place by statute. Schools are legally required to administer and report tests that a machine can pass without comprehension. Teachers are evaluated, in part, on how well their students perform on those tests. Students are sorted, tracked, and admitted to colleges based on scores that are now achievable by software that does not know what a college is.

The scoreboard is bolted to the wall by statute. What it measures already changed.

The Extinction in the Classroom

painting young student chardin
Jean-Siméon Chardin, "Young Student Drawing" (c. 1738). Kimbell Art Museum. A child absorbed in the process, not the grade. Public domain.

The wrong scoreboard would be merely wasteful if all it did was measure the wrong thing. But Susan Engel's research suggests it does something worse: it actively destroys the thing it should be measuring.

Engel is a developmental psychologist at Williams College and the author of The Hungry Mind: The Origins of Curiosity in Childhood. She spent years observing classrooms and counting something no one else was counting. She counted curiosity. She defined it operationally as questions, intent and directed gazing, and the manipulation of objects, and she measured how often these episodes occurred in real classrooms at different grade levels.^[11]

The results were stark. Curiosity episodes occurred 2.36 times in a two-hour stretch in kindergarten. By fifth grade, the rate had dropped to 0.48. Many fifth-grade classrooms yielded not a single student question across a full two-hour observation. Eleven-year-olds were sitting in rooms dedicated to learning for hours at a time without indicating anything they wanted to know about.^[12]

The decline is not developmental. Children do not lose curiosity because they mature out of it. The decline is adaptive. Children learn that curiosity is not rewarded by the system they inhabit, and they stop. They adapt their knowledge-seeking strategies to meet the demands of formal education, and formal education does not demand curiosity. It demands compliance with measurable outcomes.

She documented the mechanisms. Teachers give perfunctory responses to curious questions and redirect students to the lesson plan. In one classroom, she heard a teacher tell a child: "I can't answer questions right now. Now, it's time for learning." As if questions and learning were opposites.^[13]

The pressure comes from the scoreboard. Teachers are under institutional mandate to hit learning goals that are "obvious, explicit, and measurable." A child's spontaneous question about why Saturn has rings during a math lesson is, from the scoreboard's perspective, an interruption. The more packed and goal-oriented the school day becomes, the less room there is for wondering. The scoreboard tells the teacher: get back to the plan. The child learns the lesson the scoreboard teaches: don't ask.

Her survey of teachers revealed how the system perpetuates itself. Over 75 percent of teachers circled "curiosity" when given a list of qualities they wanted to nurture. But when asked to generate qualities without a list, almost none of them said it.^[14] Curiosity is invisible to them as a goal, because the scoreboard they are evaluated on does not include it.

The system is not failing to measure curiosity. It is succeeding at extinguishing it.

What AI Did to the Grade

painting dorfschule anker
Albert Anker, "Die Dorfschule von 1848" (1896). Kunstmuseum Bern. Different children, different starting points, one system measuring them all the same way. Public domain.

The scoreboard's numbers are climbing. A 2025 study by the Centre for Economic Policy Research tracked 36,000 students across 6,000 courses at a leading Israeli university from 2018 to 2024. In courses where AI could help – take-home essays, projects, written assignments – grades went up after ChatGPT's release. They went up most for the lowest-performing students. The grade distribution compressed. The researchers titled their findings with four words that captured the problem: "Grades up, signals down."^[15]

The grades looked better than ever. What they meant had never been less clear.

Jennifer Hurley, who has taught English at Ohlone College in the Bay Area since 2001, recognized this problem before AI made it visible. Ohlone is a community college. Hurley's students are working adults, first-generation students, and formerly incarcerated learners.^[16] She stopped grading years ago, because she saw what the grade was actually measuring. "My students actually worked much harder when they were not being graded," she said, "which contradicts the common wisdom that grades are essential to student motivation."^[17]

Gruner, from her classroom at Richmond, named the deeper problem: "Sometimes what I was really grading was a student's background."^[18] The student who writes polished prose in their first draft may have attended a private school with a twelve-to-one ratio. The student who writes rough prose may be composing in their second language after a night shift. Grade the output, and you grade the origin story.

AI has made this equation both better and worse. A 2024 Carnegie Mellon study found that generative AI reduced graduate students' writing time by 57 percent and improved quality from an A- to an A, with the largest gains among ESL students.^[19] The gap between native and non-native speakers narrowed. That is good for equity. It is devastating for the scoreboard. If AI closes the gap, the grade can no longer tell you who learned and who was leveled by a machine.

And the leveling comes at a cost. A study of tens of thousands of admissions essays at a selective American college found that writing became measurably more homogeneous after AI tools became widely available in 2022. The most convergence appeared among lower-income students and students who were rejected.^[20] The students who most needed their voice to stand out were the ones whose voices were being flattened. AI did not just erase the gap. It erased the person.

The scoreboard's numbers are rising. The signal is collapsing. The students who were already invisible to the system are becoming more so, not less.

The Grade She Still Has to Give

Elisabeth Gruner proved the scoreboard was wrong. Her university agrees. Her colleagues in STEM are following her lead. Guy Kawasaki interviewed her about it. The Associated Press distributed her findings to forty outlets on four continents.

At the end of every semester, she opens the university's grading portal and submits a final letter grade for each student. The system requires it. The transcript requires it. The graduate schools that will evaluate her students require it. The employers who will never read a portfolio but will glance at a GPA require it. The pipeline from K-12 testing to college GPA has no input field for "this student's curiosity increased" or "this student took risks she never would have taken if I had been scoring her."

Gruner removed the wrong scoreboard from her classroom. She could not remove it from the system her classroom exists inside. The scoreboard is not a choice individual teachers make. It is an infrastructure, reinforced by law, by institutional policy, by a $39 billion industry, and by the simple fact that a letter grade fits in a database field and a portfolio does not.

Meanwhile, a machine that has never been curious about anything sits in the 89th percentile of the SAT. It has never taken a risk to see what would happen or revised a sentence because it cared about the quality of its own thinking. A teacher's comment will not change its mind, because it does not have one. It scored higher than most of the students in Gruner's classroom, and it will score higher again tomorrow, and the day after, and every day until someone changes what the scoreboard measures.

The question is not whether the scoreboard is wrong. Gruner showed it at Richmond. Engel counted it disappearing in real classrooms. Hurley lived it until she nearly quit. A language model confirmed it by accident.

The question is what the new scoreboard should look like — what it would measure, who is already building it, and why AI makes it possible for the first time.

Footnotes

This is Part 1 of a two-part series. Part 2, "The New Scoreboard," examines what should replace grades and test scores — the research that proves curiosity can be measured, the schools and countries already doing it, and the tools that make the learning process visible in an AI-mediated world.

Sage.Education