Monday, December 29, 2025

Fighting Fire with Fire: Scalable Oral Exams with an ElevenLabs Voice AI Agent

It all started with cold calling.

In our new "AI/ML Product Management" class, the "pre-case" submissions (short assignments meant to prepare students for class discussion) were looking suspiciously good. Not "strong student" good. More like "this reads like a McKinsey memo that went through three rounds of editing," good.

So we started cold calling students randomly during class.

The result was... illuminating. Many students who had submitted thoughtful, well-structured work could not explain basic choices in their own submission after two follow-up questions. Some could not participate at all. This gap was too consistent to blame on nerves or bad luck. If you cannot defend your own work live, then the written artifact is not measuring what you think it is measuring.

Brian Jabarian has been doing interesting work on this problem, and his results both inspired us and gave us the confidence to try something that would have sounded absurd two years ago: running the final exam with a Voice AI agent.


Why oral exams? And why now?

The core problem is simple: students have immediate access to LLMs that can handle most exam questions we traditionally use for assessment. The old equilibrium, where take-home work could reliably measure understanding, is dead. Gone. Kaput.

Oral exams are a natural response. They force real-time reasoning, application to novel prompts, and defense of actual decisions. The problem? Oral exams are a logistical nightmare. You cannot run them for a large class without turning the final exam period into a month-long hostage situation.

Unless you cheat.


Enter the Voice Agent

We used ElevenLabs Conversational AI to build the examiner. The platform bundles the messy parts (speech-to-text, text-to-speech, turn-taking, interruption handling, …) into something usable. And here is the thing that surprised me: a basic version for a low-stakes setting (e.g., an assignment) can be up and running in literally minutes. Minutes. Just write a prompt describing what the agent should ask the student, and you are done.

Two features mattered a lot for our setup:

  • Dynamic variables: pass the student's name, project details, and other per-student context into the conversation as parameters
  • Workflows: build a structured flow with sub-agents instead of a single "chatty" agent trying to do everything

What the exam looked like

We ran a two-part oral exam.

Part 1: "Talk me through your project." The agent asks about the student's capstone project: goals, data, modeling choices, evaluation, failure modes. This is where the "LLM did my homework" strategy dies. You can paste an assignment into ChatGPT. It is much harder to improvise consistent answers about specific decisions when someone is drilling into details.

Part 2: "Now do a case." The agent picks one of the cases we discussed in class and asks questions spanning the topics we covered: basically testing whether students absorbed the material or just showed up.

To handle this structure, we split the exam into sub-agents in a workflow:

  1. Authentication agent: Asks for the student's ID and refuses to proceed without a valid one. (In a more productized version, we would integrate with NYU SSO instead of checking against a list.)
  2. Project discussion agent: Gets project context injected via parameters. The prompt includes details of each project so the agent can ask informed questions. The next step is obvious: connect retrieval over the student's submitted slides and reports so the agent can quote and probe precisely.
  3. Case discussion agent: Selects a case and runs structured questioning. Again, RAG would help with richer case details.

This "many small agents" approach is not just aesthetic. It prevents the system from drifting into unbounded conversation, and it makes debugging possible.


By the Numbers

  • 36 students examined over 9 days
  • 25 minutes average (range: 9–64)
  • 65 messages per conversation on average
  • 0.42 USD per student (15 USD total)
  • 89% of LLM grades within 1 point
  • Shortest exam (9 min) → highest score (19/20)

The economics

Let's talk money.

Total cost for 36 students: 15 USD.

That's 8 USD for Claude (the chair and heaviest grader), 2 USD for Gemini, 0.30 USD for OpenAI, and roughly 5 USD for ElevenLabs voice minutes. Forty-two cents per student.

The alternative? 36 students × 25-minute exam × 2 graders = 30 hours of human time. At TA rates (~25/hour), that's 750. At faculty rates, it's "we don't do oral exams because they don't scale."

For 15 dollars, we got: real-time oral examination, a three-model grading council with deliberation, structured feedback with verbatim quotes, a complete audit trail, and—as you'll see—a diagnosis of our own teaching gaps.

The unit economics in terms of cost work. We will see next that the real benefit is in the value that is delivered, not in the 50x cost savings.


What broke (and how we fixed it)

The first version had problems. Here is what we learned.

1) The voice was intimidating

A few students complained that the agent sounded severe. We had cloned Foster Provost's voice because, frankly, his clone was much more accurate than the clones of our own voices. But the students found it... intense. Here is an email from a student:

I had prepared thoroughly and felt confident in my understanding of the material, but the intensity of the interviewer's voice during the exam unexpectedly heightened my anxiety and affected my performance. The experience was more triggering than I anticipated, which made it difficult to fully demonstrate my knowledge. Throughout the course, I have actively participated and engaged with the material, and I had hoped to better demonstrate my knowledge in this interview.

And here is another:

Just got done with my oral exam. [...] I honestly didn't feel comfortable with it at all. The voice you picked was so condescending that it actually dropped my confidence. [...] I don't know why but the agent was shouting at me.

Fix: We are split on that. We love FakeFoster. But next time we will A/B test, and we will try to test other voices. At the end of the day, we want to optimize for comprehension, not charisma. ElevenLabs has guidance on voice and personality tuning: they treat this as a product design problem, and probably a good idea.

2) The agent stacked questions

This was the biggest real issue. The agent would ask something like: "Explain your metric choice, and also tell me what baselines you tried, and why you did not use X, and what you would do next."

That is not one question. That is four questions wearing a trench coat. The cognitive load for an oral exam is already high. Stacking questions makes it brutal.

Fix: Hard rule in the prompt: one question at a time. If you want multi-part probing, chain it across turns. For grading the exam, we included an "interference protocol": students received full credit if they had questions stacked like that and answered only some of them.

3) Clarifications became moving targets

Student: "Can you repeat the question?"
Agent: paraphrases the question in a subtly different way

Now the student is solving a different problem than the one they were asked. Very frustrating.

Fix: Explicit instruction in the prompt: repeat verbatim when asked to repeat. No paraphrasing. Same words.

4) The agent did not let students think

Humans rush to fill silence. Agents do too. Students would pause to think, and the agent would jump in with follow-up probes or worse: interpret the silence as confusion and move on.

Fix: Tell the agent to allow think-time without probing aggressively. It made the exam feel less like an interrogation. We also increased the time-out before the agent asks "Are you there?" from 5 to 10 seconds.

5) Lack of randomization

We asked the agent to "randomly select" a case study. It did not.

From December 12–18, when Zillow was in the case list, the agent picked Zillow 88% of the time. After we removed Zillow from the prompt on December 18, the agent immediately latched onto Predictive Policing—picking it for 16 out of 21 exams on December 19 alone.

LLMs are not random. They have implicit preferences and ordering biases. Asking an LLM to "pick randomly" is like asking a human to "think of a number between 1 and 10"—you're going to get a lot of 7s.

Fix: Pass an explicit random number as a parameter and map it to cases deterministically. Do the randomization in code, not in the prompt.


Grading: the council deliberation actually worked

OK, so here is where things got interesting.

We graded using a "council of LLMs" approach, an idea we borrowed from Andrej Karpathy. Three models (Claude, Gemini, ChatGPT) assessed each transcript independently. Then they saw each other's assessments and revised. Finally, the chair (Claude) synthesized the final grade with evidence.

Round 1 was a mess. When the models graded independently, agreement was poor: 0% of grades matched exactly, and only 23% were within 2 points. The average maximum disagreement was nearly 4 points on a 20-point scale.

And here's the kicker: Gemini was a softie: It averaged 17/20. Claude averaged 13.4/20. That's a 3.6-point gap—the difference between a B+ and a B-.

Meanwhile, Claude and OpenAI were already aligned: 70% of their grades were within 1 point of each other in Round 1.

Model Round 1 Mean Round 2 Mean Change
Claude 13.4/20 13.9/20 +0.5
OpenAI 14.0/20 14.0/20 +0.0
Gemini 17.0/20 15.0/20 -2.0

Then came consultation. After each model saw the others' assessments and evidence, agreement improved dramatically:

Metric Round 1 Round 2 Improvement
Perfect agreement 0% 21% +21 pp
Within 1 point 0% 62% +62 pp
Within 2 points 23% 85% +62 pp
Mean max difference 3.93 pts 1.41 pts -2.52 pts

Gemini lowered its grades by an average of 2 points after seeing Claude's and OpenAI's more rigorous assessments. It couldn't justify giving 17s when Claude was pointing to specific gaps in the experimentation discussion.

Grade convergence chart

But here's what's interesting: the disagreement wasn't random. Problem Framing and Metrics had 100% agreement within 1 point. Experimentation? Only 57%.

Why? When students give clear, specific answers, graders agree. When students give vague hand-wavy answers, graders (human or AI) disagree on how much partial credit to give. The low agreement on experimentation reflects genuine ambiguity in student responses, not grader noise.

The grading was stricter than my own default. That's not a bug. Students will be evaluated outside the university, and the world is not known for grade inflation.

The feedback was better than any human would produce. The system generated structured "strengths / weaknesses / actions" summaries with verbatim quotes from the transcript. Sample feedback from the highest scorer:

"Your understanding of metric trade-offs and Goodhart's Law risks was exceptional—the hot tub example perfectly illustrated how optimizing for one metric can corrupt another."

Sample from a B- student:

"Practice articulating complete A/B testing designs: state a hypothesis, define randomization unit, specify guardrail metrics, and establish decision criteria for shipping or rolling back."

Specific. Actionable. Tied to evidence. No human grader has the time to generate that for every student.


It diagnosed our teaching gaps

Ha! This one stung.

Topic performance chart

When we analyzed performance by topic, one bar stuck out like a sore thumb: Experimentation. Mean score: 1.94 out of 4. Compare that to Problem Framing at 3.39.

The breakdown was brutal:

  • 3 students (8%) scored 0—couldn't discuss it at all
  • 7 students (19%) scored 1—superficial understanding
  • 15 students (42%) scored 2—basic understanding
  • 0 students scored 4—no one demonstrated mastery

We had rushed through A/B testing methodology in class. The external grader made it impossible to ignore.

The grading output became a mirror reflecting our own weaknesses as instructors. Ooof.

Duration ≠ Quality

One finding I found strangely fascinating: exam duration had zero correlation with score (r = -0.03). The shortest exam—9 minutes—got the highest score (19/20). The longest—64 minutes—scored 12/20.

Taking longer doesn't mean you know more. If anything, it signals struggling to articulate. Confidence is efficient.


Anti-cheating (or: trust but verify)

We asked students to record themselves while taking the exam (webcam + audio). This discourages blatantly outsourcing the conversation, having multiple people in the room, or having an LLM in voice mode whispering answers. It also gives us a backup record in case something goes really badly.

And here is an underrated benefit of this whole setup: the exam is powered by guidelines, not by secret questions. We can publish exactly how the exam works—the structure, the skills being tested, the types of questions. No surprises. The LLM will pick the specific questions live, and the student will have to handle them.

This reduces anxiety and pushes students toward actual preparation instead of guessing what the instructor "wants." And it eliminates the leaked-exam problem entirely. Practice all you want—it will only make you better prepared.


What the students said

We surveyed students before releasing grades to capture their experience. Some of the results:

  • Only 13% preferred the AI oral format. 57% wanted traditional written exams. 83% found it more stressful.
  • But here's the thing: 70% agreed it tested their actual understanding: the highest-rated item. They accepted the assessment but not the delivery.
  • At the same time, they almost universally liked the flexibility of taking the exam at their own place and time. Yes, many of them would have also preferred a take-home exam instead of the oral exam, but this format is dead now.
  • 83% of students found the oral exam framework more stressful than a written exam.
  • The fix is clear: one question at a time, slower pacing, calmer tone. The concept works. The execution needs iteration.

Student survey results


Try it yourself

If you want to experiment with this approach, here are some resources:

  • Prompt for the voice agent
  • Prompt for the grading council
  • Link to try the voice agent (use Konstantinos as the name and kr888 as the net id to authenticate; the project was a "LinkedIn Recruiter, an agent that scans profiles and automatically sends personalized DMs to candidates on behalf of a recruiter. It engages in the first 3 turns of chat to answer basic questions (salary, location) before handing off to a human.")

What I would change next time

  1. Slower pacing and a calmer voice: We love you FakeFoster, but GenZ is not ready for you. Perhaps we will deploy FakePanos next time. Too bad ElevenLabs hasn't perfected thick accents yet to deliver a real Panos experience.
  2. RAG over student artifacts (slides, reports, notebooks). ElevenLabs supports this directly. If the agent can quote the student's own submission, the exam becomes much harder to game and much more diagnostically useful.
  3. Better case randomization with explicit seeding and tracking. Randomness that "feels random" is not enough. Pass explicit parameters.
  4. Audit triggers in grading. If the LLM committee disagrees beyond a threshold, flag for human review. The point of a committee is not to pretend the result is always certain; it is to surface uncertainty.
  5. Accessibility defaults. Offer practice runs, allow extra time, and provide alternatives when voice interaction creates unnecessary barriers.

The bigger point

Take-home exams are dead. Reverting to pen-and-paper exams in the classroom feels like a regression.

We need assessments that evolve towards formats that reward understanding, decision-making, and real-time reasoning. Oral exams used to be standard until they could not scale. Now, AI is making them scalable again.

And here is the delicious part: you can give the whole setup to the students and let them prepare for the exam by practicing it multiple times. Unlike traditional exams, where leaked questions are a disaster, here the questions are generated fresh each time. The more you practice, the better you get. That is... actually how learning is supposed to work.

Fight fire with fire.


Thanks to Brian Jabarian for the inspiration and for giving us confidence that these interviews will work, Foster Provost for lending his voice to create the FakeFoster agent (sorry, students found you intimidating!), and Andrej Karpathy for the council-of-LLMs idea.

Saturday, March 22, 2025

Training LLaMA using LibGen: Hack, a Theft, or Just Fair Use?

Imagine you're building a Large Language Model. You need data—lots of it. If you can find text data of high quality, vetted, truthful, and useful, it would be... great! So, naturally, you head online and find a treasure trove of books neatly indexed, conveniently downloadable, and completely free. The catch? You're looking at LibGen—one of the most infamous pirate libraries on the internet.

This isn't hypothetical. Recently, Meta made headlines for allegedly training their flagship LLM, LLaMA, on content from LibGen. But—can you even do that?

Let’s unpack the legal mess behind the scenes, step-by-step.

First: Is Using LibGen Even Legal?

Short answer: Absolutely not. Downloading copyrighted books from LibGen is textbook piracy. Think of it like grabbing a handful of snacks at the supermarket without paying—it's convenient but totally illegal.

Second: Does Training an AI Change the Equation?

Here’s where it gets fuzzy. In the U.S., you can claim "fair use"—the idea that some copying is permissible if you're transforming the original work into something new and valuable. (We covered this in an earlier blog post.

Remember the Google Books case? Google scanned millions of books without permission. Authors sued, but courts sided with Google, citing fair use. The logic was that indexing books for search purposes created something valuable without substituting the original.

Consider another example: the Authors Guild v. HathiTrust case. Libraries scanned books to help visually impaired readers and enable text search. Courts also ruled this fair use, emphasizing the transformative nature and public benefit. However, both these cases involved legally acquired copies—not pirated ones.

So, could Meta’s training of LLaMA fall under the same umbrella? Possibly, yes, claiming the same fair use approach. There is a subtle difference: Google used legally accessible copies (from libraries), while Meta reportedly took a different route. Legally speaking, when we talk about copyright and fair use in the US, the source of the copyrighted data does not directly affect the outcome. (Although it can affect the attitude of a jury or a judge if they believe that the defendant acted in bad faith.)

Third: What About the EU?

If you thought U.S. law was tricky, the EU adds another layer of complexity. They don’t have a broad "fair use" policy, but they've introduced exceptions specifically for Text and Data Mining (TDM). Good news for researchers and AI developers, right? Except there's a big "BUT": EU law explicitly requires lawful access. Pirate libraries like LibGen don't qualify.

In other words, in Europe, using LibGen isn't just risky—it's explicitly illegal.

Fourth: Is there a Legal Defense for using LibGen?

There is a very reasonable argument that training an AI is transformative—after all, an LLM doesn’t copy books; it learns from them. Consider also the LAION case from Germany. LAION, a nonprofit, scraped images from stock photo sites to train AI models. The court allowed it, but crucially because LAION had legitimate access and was a non-commercial entity. The outcome might differ sharply for a commercial giant sourcing pirated content.

There is also the counterargument from authors and publishers that LLMs themselves create (for competitive reasons) a market for licensing content, as the different LLM providers try to get access to exclusive, licensed content as a differentiating factor, in the same way that various streaming companies compete to get exclusive access to films, shows, and TV series. It is a bit of a circular argument (without free training of LLMs, can the LLMs get good enough to create a licensing market?), but we will have to wait for the courts to decide.

Fifth: What's the Risk Here?

For researchers at universities or small startups, casually using LibGen might seem harmless. The risks escalate quickly when you're a global company. Training on "presumed free" copyrighted data differs from "willful infringement"—the legal term for knowingly breaking copyright law. 

The fact that LLaMA is Open Source is a significant factor here, as there is less of a profit factor here, but when the trainer is a trillion-dollar company, the courts may behave differently. We will see...

After all, while pirates make great movie characters, they're generally less popular in courtrooms.

Monday, February 24, 2025

Copyright, Fair Use, and AI Training

[We tested the o1-pro model to give us a detailed analysis of the legal landscape around copyright and the use of copyrighted materials to train LLMs. The full discussion is available here. Below you will find a quick attempt to summarize the (much) longer report by o1-pro.]

What is Copyright? (And Why Should You Care?)

Imagine you spend months writing a book, composing a song, or designing a killer app—wouldn't you want some protection to stop someone from copying it and making money off your hard work? That’s where copyright steps in! It grants the copyright holder exclusive rights to reproduce, distribute, and display their work. However, copyright isn’t an all-powerful lock—there are important exceptions, like fair use, that allow for some unlicensed use, especially when it benefits society.

Copyright laws are all about balance. Too much restriction, and we block innovation and education. Too little, and creators lose their incentive to make new things. Governments step in to help find that sweet spot—protecting creators' rights while making sure knowledge, art, and innovation stay accessible.

The Fair Use Doctrine: When Borrowing is (Sometimes) Okay

Fair use is like the ultimate legal “it depends” clause in copyright law. It allows limited use of copyrighted materials without permission—whether for education, commentary, parody, or research. But how do you know if something qualifies as fair use? Courts consider these four big factors:

  1. Purpose and Character of the Use – Is the use transformative? Does it add new meaning or context? And is it for commercial gain or educational purposes?
  2. Nature of the Copyrighted Work – Is the original work factual (easier to use under fair use) or highly creative (harder to justify copying)?
  3. Amount and Substantiality – How much of the original is used, and is it the “heart” of the work?
  4. Effect on the Market – Does this use harm the copyright holder’s ability to profit from their work?

What Do Past Cases Tell Us About Fair Use?

Google Books Case (Authors Guild v. Google, 2015): Google scanned millions of books to make them searchable, showing only small snippets of text. The Second Circuit ruled this was fair use because:
  • It was highly transformative—it helped people find books rather than replacing them.
  • The snippets were not a market substitute—nobody was reading full books this way.
  • Instead of harming book sales, it actually helped readers find books to purchase.
Google Search Indexing (Perfect 10 v. Google, 2007): Google’s image search displayed thumbnail previews linking to full-size images. The Ninth Circuit ruled this was fair use because:
  • It served a different function—helping users find images, not replacing the originals.
  • Any market harm was speculative—there was no proof Google’s thumbnails hurt sales.
LinkedIn Scraping Case (hiQ Labs v. LinkedIn, 2019): hiQ Labs scraped publicly available LinkedIn profiles to analyze workforce data. LinkedIn sued, claiming this violated its terms of service. The Ninth Circuit ruled that scraping publicly accessible data wasn’t illegal under the Computer Fraud and Abuse Act (CFAA), but the case raised bigger questions about data ownership and fair use. This case matters for AI because it highlights the legal gray area of using publicly available content for AI training—does scraping data for machine learning function like search indexing (which courts favor) or unfairly compete with content creators?

When Courts Say “Nope” to Fair Use

When a copied work competes directly with the original, courts usually rule against fair use:

  • Texaco Case (American Geophysical Union v. Texaco, 1994) – Texaco photocopied journal articles for internal research. The court ruled this wasn’t fair use because Texaco could’ve just bought the licenses, and widespread copying threatened the scientific journal market.
  • Meltwater Case (Associated Press v. Meltwater) – Meltwater, a news aggregation service, copied AP excerpts. The court ruled this wasn’t fair use because it replaced a licensable market for news monitoring services.

How Does This Apply to AI Training?

AI models like ChatGPT train on huge datasets, including copyrighted text. Courts will likely analyze this under fair use principles by asking:

  • Is AI training transformative? AI companies argue that their models learn patterns rather than copying content. This mirrors Google Books, where scanning books for search indexing was deemed transformative.
  • Does AI-generated text replace the original? If AI can generate news summaries or books, it might compete with the markets for journalism, books, or educational content—similar to Meltwater replacing a paid service.
  • Is there a licensing market? If publishers and authors start licensing data for AI training, unlicensed use could be seen as market harm—like in Texaco, where academic publishers had a functioning licensing system.

The outcome of ongoing lawsuits will determine how courts see AI’s role in the content economy. If AI models start functioning as substitutes for original content, expect stricter copyright enforcement. If they’re seen as research tools, fair use might hold up.

Industry-Specific Market Harm Considerations

  1. News & Journalism – AI-generated summaries may reduce clicks on original articles, hurting ad revenue and subscriptions (New York Times v. OpenAI argues AI responses replace direct readership).
  2. Book Publishing – Authors claim AI-generated text could compete with traditional books and summaries (Authors Guild v. OpenAI argues AI models reduce demand for original works).
  3. Education & Academic Publishing – AI-generated study materials could cut into textbook sales (Pearson v. OpenAI claims AI-generated content could replace traditional textbooks).
  4. Creative Writing & Film – AI-generated scripts or novels could impact demand for human writers (Writers Guild v. OpenAI and Martin v. OpenAI argue AI mimicking authors threatens their markets).

The Future of AI and Copyright Law

Current lawsuits (New York Times v. OpenAI, Authors Guild v. OpenAI) will set precedents for AI copyright law. Possible outcomes include:

  • AI training as fair use – If courts find AI models transformative and non-substitutive.
  • AI training as infringement – If courts rule that it undermines a viable licensing market.
  • New licensing systems – Like how music royalties work, AI companies may have to pay creators.

Wrapping It Up

So, what’s the big takeaway? AI and copyright law are in a messy, ongoing battle. Will AI companies get a free pass under fair use, or will copyright holders demand licensing fees? We don’t know yet, but these decisions will shape the future of AI.

My bet? AI companies will create new markets where content creators can contribute and get paid—like YouTube does for video creators. Instead of just scraping data, AI firms will likely find ways to reward quality content, making it a win-win for tech and creatives alike.

Friday, September 6, 2024

Developing Grading Rubrics using Docent

When I explain the concept of Docent, a common first question I hear is if AI grades assignments, can't students just use AI to do their homework? They imagine a scenario where professors create assignments with AI, students complete them with AI, and graders assess them with AI as well.

But, let me clear that up—Docent doesn't work like that.

While we were crafting Docent, we figured out we really needed two things to make AI grading effective:
  1. A "gold answer," which is basically the perfect solution to the assignment.
  2. A "grading rubric," which is a guide on how to deduct points for mistakes.
The "gold answer" is our way of ensuring that even if an AI is doing the grading, students can't just whip up another AI to spit out the right answers. (For the future, we're considering adding a feature where Docent can tell if an assignment is easily solvable by an LLM, without having access to the gold answer.)

Now, developing a comprehensive grading rubric is a bit trickier. It's hard to guess all the ways students might slip up. In an "AI-less setting", we usually end up tweaking the rubric a few times over time, based on what we see after the assignment has been run a couple of times.

How can an LLM make our life easier? Docent is great when it comes to building these rubrics. Since it can handle grading hundreds of assignments at once, we quickly spot the common mistakes by simply asking Docent to grade the assignments and find the mistakes. We look at the identified mistakes, and we add them in the rubric. After adjusting the rubric, we ask Docent for a re-grade, and voila! After a few rounds, we end up with a solid rubric that catches most errors.

One additional cool thing about this whole process? Docent can summarize feedback from all the submissions and we can create a report on the most common slip-ups. We take this back to the classroom to chat about the tricky parts of the assignment and help everyone learn better.

It's like having a super-hard-working assistant who may not know how to grade at the beginning but is always willing and eager to help. They never complain if you ask them to regrade assignments, summarize findings, or provide feedback. 

Use Docent, be lazy, and teach smarter, not harder!

Thursday, September 5, 2024

Grading with AI: Introducing Docent

TL;DR

An alpha version of Docent, our experimental AI-powered grading system, is now available at https://get-docent.com/. If you're interested in using the system, please contact us for support.

The Challenge of Grading

One thing that I find challenging when teaching is grading, especially in large classes with numerous assignments. The task is typically delegated to teaching assistants with varying levels of expertise and enthusiasm. One particular challenge is getting TAs to provide detailed, constructive feedback on assignments.

Our Experiment with LLMs

With the introduction of LLMs, we began exploring their potential to enhance the grading process. Our primary goal wasn't to replace human graders but to provide students with detailed, personalized feedback—effectively offering an on-demand tutor and addressing "Bloom's two-sigma problem.":

"The average student tutored one-to-one using mastery learning techniques performed two standard deviations better than students educated in a classroom environment."

To evaluate the effectiveness of LLMs in grading, we used a dataset of 12,546 student submissions from a Business Analytics course spanning six academic semesters. We used human-assigned grades as our benchmark.

Good Quantitative Results

Our findings revealed a remarkably low discrepancy between LLM-assigned and human grades. We tested various LLMs using different approaches:

  • With and without fine-tuning
  • Zero-shot and few-shot learning

While fine-tuning and few-shot approaches showed slight improvements, we were amazed to find that GPT-4 with zero-shot learning achieved a median error of just 0.6% compared to human grading. In practical terms, if a human grader assigned 80/100 to an assignment, the LLM's grade typically fell within the 79.5-80.5 range—a striking consistency with human grading.

Qualitative Feedback: Where AI Shines

LLMs excel at providing qualitative feedback. For example, in this ChatGPT thread, you can see the detailed feedback the LLM provided for an SQL question in a database course. Much better and more detailed than whatever any human grader was going to ever provide.

Real-World Implementation: Docent

Encouraged by these results, we implemented Docent to assist human graders in our Spring and Summer 2024 classes. We also conducted a user study to assess the perceived helpfulness of LLM-generated comments. However, during deployment, we identified several areas for improvement:

  1. Excessive Feedback: The LLM often provides too much feedback, striving to find issues even in near-perfect assignments. 
  2. Difficulty with Negation: Despite clear grading guidelines, LLMs struggle to ignore specified minor shortcomings. See below :-) 


  3. Multi-Part Assignment Challenges: For assignments with multiple questions, grading each question separately yields better results than assessing the entire assignment at once.
  4. Inconsistent Performance: While median performance is excellent, about 5-10% of assignments receive imperfect grades (compared to a human), leading to student appeals.

Current Status and Recommendations

Based on our experiences, here are our current recommendations for using AI in grading:

  1. Human Supervised Use: Grading using LLMs is best used as a tool for teaching assistants, who should review and adjust the AI-generated grades and feedback before releasing them to students.
  2. Caution in High-Stakes Scenarios: We advise against using AI for high-stakes grading, such as final exams, until we achieve greater robustness across all submissions.
  3. Ideal for Low-Stakes Assignments: LLM-based feedback is well-suited for low-stakes assignments and practice questions, where even imperfect feedback improves the current status quo.

Try Docent

To facilitate experimentation with AI-assisted grading, we've deployed an alpha version of Docent at https://get-docent.com/. If you're interested in using the system, please contact us for support and guidance.