The New Scoreboard

This is Part 2 of a two-part series. Part 1, "The Wrong Scoreboard," examined why grades and standardized test scores were always the wrong metric — and how AI proved it.

In 2011, a meta-analysis reviewed two hundred studies on academic performance and arrived at a finding that should have rewritten education policy. Sophie von Stumm, Benedikt Hell, and Tomas Chamorro-Premuzic, researchers at the University of Edinburgh, Konstanz, and Goldsmiths, identified three pillars that predict how well a student will do in school: intelligence, effort, and intellectual curiosity. Curiosity and effort rivaled intelligence in predicting outcomes.^[1]

Nobody disputed the finding. It was ignored. Not because anyone disputed the data, but because curiosity was considered too fuzzy to measure, too subjective to standardize, and too inconvenient to build a system around. Intelligence had IQ tests. Effort had attendance records and homework completion rates. Curiosity had nothing that fit in a database field.

That was 2011, and it is no longer true.

The Instruments Exist

painting geographer vermeer
Johannes Vermeer, "The Geographer" (1669). Städel Museum. A scholar absorbed in inquiry, measuring the world on his own terms. Photo: jimmiehomeschoolmom, CC BY 2.0.

Todd Kashdan, a professor of psychology at George Mason University, spent years developing what the field lacked: a validated instrument for measuring curiosity. The result, published in the Journal of Research in Personality, is the Five-Dimensional Curiosity Scale.^[2] It measures joyous exploration (the pleasure of seeking new information), deprivation sensitivity (the discomfort of not knowing), stress tolerance (the willingness to embrace uncertainty), social curiosity, and thrill seeking. It has been validated across populations, including children as young as nine.^[3] It is not a vague aspiration. It is a psychometric instrument, freely available to any school that wants to use it.

After Kashdan published, the objection that curiosity was "too fuzzy to measure" should have died. It persists because the people running the scoreboard have not read the research.

At the University of Virginia, Jamie Jirout built the next piece: her Curiosity in Classrooms Framework, published in Frontiers in Psychology in 2022, the first validated observation instrument for coding curiosity-promoting instruction in real classrooms.^[4] Where Kashdan's scale measures the student's curiosity, Jirout's framework measures whether the teacher is fostering it.

Her feasibility study was damning. Jirout observed thirty-five elementary math lessons and found that fewer than half included any language recognizing comfort with uncertainty.^[5] The instrument works. Most classrooms do not.

After the Five-Dimensional Curiosity Scale, the objection that curiosity was "too fuzzy to measure" should have died. It persists because the people running the scoreboard have not read the research.

At Stanford, the SPARK Lab's MAGIC Project (Measuring Acquisition and Growth of Inquiry and Curiosity) is building performance-based assessment tasks that measure curiosity, creative thinking, problem-solving, and scientific inquiry in young learners. The project is designed for equity from the ground up, validated across diverse language, racial, ethnic, and socioeconomic backgrounds. Preliminary findings were presented at the Society for Research in Child Development conference in 2025.^[6]

Three instruments from three institutions, each approaching the same problem differently: making visible the thing the old scoreboard was designed to ignore.

The Countries That Changed the Game

painting young girl reading fragonard
Jean-Honoré Fragonard, "Young Girl Reading" (c. 1770). National Gallery of Art. Absorption, not performance. The process of learning as its own reward. Public domain.

The research would be academic if no one had acted on it. Someone has. Several someones, governing entire nations — and AI has turned their experiments into vindication.

Singapore's Ministry of Education eliminated all mid-year examinations in primary and secondary schools by 2023. Primary 1 and 2 students receive no weighted assessments at all; progress is measured with qualitative descriptors only. Older students are assessed through class tests, quizzes, presentations, and group projects distributed throughout the term. The ministry's stated goal was to "nurture joy for learning."^[7] The results arrived faster than the skeptics expected. Singapore's fifteen-year-olds topped the 2022 PISA creative thinking assessment globally.^[8] Teachers reported that alternate assessment methods like project work made students more creative.^[9]

Then AI arrived in every student's pocket, and Singapore's reform looked less like an experiment and more like preparation. A system that had already stopped trusting final outputs as the measure of learning did not have to panic when a machine became capable of producing those outputs. The countries still grading essays and exam answers are the ones scrambling to detect AI-generated work. Singapore was already looking at something else.

Singapore still administers the PSLE at Primary 6.^[10] Even the world's most successful reformer has not fully abandoned the old scoreboard. But they proved that removing large pieces of it produces the outcomes the scoreboard claimed to care about — and left themselves ready for a world where the old metrics measure nothing a machine cannot replicate.

New Zealand built its curriculum around inquiry learning, starting from students' questions rather than prescribed content, and ranked fifth out of eighty-one countries in creative thinking on the 2022 PISA assessment — without requiring standardized exams.^[11] Even South Korea, a country synonymous with test pressure, has shifted: its 2024 curriculum revision allocates ten to fifteen percent of time to Creative Experiential Learning and specifies that tests should focus on complex tasks rather than multiple-choice questions.^[12] The direction is the same everywhere: away from outputs a machine can generate, toward processes it cannot fake.

The American Infrastructure

The international evidence would be irrelevant to American schools if the institutional infrastructure to support alternatives did not exist. It does, and AI is forcing it to grow faster.

The Mastery Transcript Consortium, founded in 2017, has grown to more than four hundred member schools replacing GPA and credit hours with competency demonstrations and multimedia portfolios. As of July 2025, seven hundred and one colleges and universities accept MTC Learning Records, including Harvard, Stanford, and MIT.^[13] Seven hundred and one admissions offices have decided that a portfolio demonstrating competency is an acceptable substitute for a number. That decision was forward-looking in 2017. It is now urgent. When a student can prompt an AI to generate an A-quality essay, the essay grade tells you nothing. The portfolio — showing the student's thinking, process, and decision-making across time — tells you what the grade never could.

The ungrading movement, documented in Part 1 through Elisabeth Gruner and Jennifer Hurley, has accumulated the research to match. A 2025 study found that removing grades decreases stress, increases intrinsic motivation, and reduces surface learning. Students in ungraded environments pursue communal rather than individualistic approaches, and higher grade distributions are matched by higher-quality work.^[14] Multiple universities have adopted gradeless first-year programs. The pattern is the same one Singapore discovered at national scale: remove the scoreboard that rewards playing it safe, and students start doing the work that matters. AI sharpens the point. In a graded classroom, the rational move is now to let the machine produce the safest possible output. In an ungraded classroom, there is no reward for outsourcing the thinking — because the thinking is the point.

Amir Nathoo, the founder of Outschool, captured the principle: "You can't expect sending kids to an institution whose design hasn't fundamentally changed in hundreds of years with centrally set curriculum is going to work."^[15]^[16] Kelly Van Sande's Ignite Learning Academy — self-paced, top-ranked in Arizona, 99 percent satisfaction — serves the students who make the old scoreboard's failure most visible: gifted learners whose curiosity takes them years ahead in some domains and sideways in others.^[17] AI did not create the case for these alternatives. It eliminated the last argument against them.

The Black Box Problem

painting scholar rembrandt
Workshop of Rembrandt, "A Young Scholar and his Tutor" (1629). Getty Center. The tutor can see the work. The process is visible. That visibility is the teaching. Public domain.

Research, instruments, schools, countries, consortia, movements — they all exist. So why has the old scoreboard not been replaced?

Part of the answer is legal inertia, documented in Part 1. Part of it is a thirty-nine-billion-dollar industry with no incentive to replace itself. But there is a third obstacle, and it is the one that matters most for what comes next: AI has made the learning process invisible.

Students are now doing significant thinking inside AI conversations — iterating on ideas, testing arguments, making decisions about what to keep and what to discard. Or they are not. They are pasting a prompt, accepting the first output, and submitting it. With every AI education tool currently available to schools, educators cannot tell which one is happening.

Every other AI tool in education is a black box. Students interact with AI inside a chat window; educators see only the final output — the essay, the answer, the project. Iterations, dead ends, moments of risk-taking or cognitive laziness, the process of thinking with AI or outsourcing thinking to AI: all of it is invisible. Without seeing the process, educators cannot provide feedback on it. All that remains to evaluate is the output.

And evaluating the output is the wrong scoreboard.

This is the paradox that Part 1 identified: Gruner proved that removing grades makes students engage with learning. Engel proved that the system extinguishes curiosity. Hurley proved that grading output is grading background. The research proves that curiosity can be measured and that schools measuring it outperform schools that do not. But AI has made the learning process invisible at the exact moment when visibility into that process is the thing educators need most.

The new scoreboard requires a new kind of tool. Not one that grades the process. Grading the process would reproduce the same distortion Gruner discovered: students would optimize for what looks like good AI interaction rather than following their curiosity or taking risks. The tool needs to make the process visible without scoring it, so that educators can see where students are thinking and where they have stopped, and provide feedback and guidance rather than a grade.

Sage.Education was built for this. It is the only AI education platform where educators can see a student's full conversation map, including regenerations and iterations, giving them insight into how students are actually interacting with AI. Educators can see whether a student is co-piloting, thinking with AI, steering it, pushing back, keeping their voice, or being cognitively lazy, accepting the first output and outsourcing the thinking. They can identify where students need guidance on how to use AI better while keeping their voice. Students and educators build their own AI agents and tools on the platform, not consuming someone else's rubric engine but creating instruments that fit their own classrooms and curiosities. Open-source, self-hostable, and privacy-first, with student data staying under school control.^[18]

The new scoreboard is not a number. It is visibility into the process of thinking. It is the conversation map that shows an educator where curiosity led and where it stopped, so they can step in not with a grade but with a conversation: tell me what happened here. Tell me what you were trying to find out. Tell me what you would try next.

The Question That Answers Itself

Elisabeth Gruner removed the grade from her classroom and discovered that the grade had been the obstacle. Her students engaged with learning for the first time in thirty years of teaching. Susan Engel counted curiosity in classrooms and discovered that the system was extinguishing it at a rate of eighty percent between kindergarten and fifth grade. Todd Kashdan proved that curiosity can be measured across five validated dimensions. Jamie Jirout proved that curiosity-promoting instruction can be observed and coded. Singapore abolished mid-year exams and topped the world in creative thinking. Seven hundred and one colleges accept transcripts that contain no grades at all.

The question from Part 1 was: if the scoreboard is wrong, what is the right one?

The answer has been in the research for fifteen years, in the classrooms for longer, and in the data from every country that outperforms the United States on the metrics that actually matter. The right scoreboard does not grade. It makes the process of learning visible, so that the people whose job it is to teach can see where curiosity leads, where it stalls, and where it needs a teacher, not a score.

The wrong scoreboard measured whether students could produce the right output. A machine can do that now. The right scoreboard measures whether students want to know more.

Gruner's students, the ones who read her comments for the first time when she stopped grading them, already knew the answer. The scoreboard was in the way. Remove it, and the learning is right there. It always was.

Footnotes

This is Part 2 of a two-part series. Part 1, "The Wrong Scoreboard," examined why grades and standardized test scores were always the wrong metric — and how AI proved it.

Sage.Education