A Computer Scientist in a Business School

Tuesday, March 17, 2026

How I Stopped Being a Copy-Paster for My AI Agent: Claude Code, Google Cloud, and the Loop to Close

TL;DR: Your AI agent in Claude Code on the Web can use Google Cloud (or AWS/Azure) to store large datasets, run long computations, deploy web apps, and schedule recurring jobs. Once you have a cloud account and project, the repo-specific setup takes about five minutes:

Set an encryption password in your environment settings (see Step 1 below). If you only use one cloud provider, name it CLOUD_CREDENTIALS_KEY. For provider-specific setups, use GCP_CREDENTIALS_KEY / AWS_CREDENTIALS_KEY / AZURE_CREDENTIALS_KEY.
Tell the agent: "Install the cloud-bootstrap skill from https://github.com/ipeirotis/cloud-bootstrap into this repo."
Tell the agent: "Set up GCP access for this project."

The agent walks you through the rest, including one command you run in Cloud Shell to generate a temporary token.

The moment I became a human copy-paster

A few weeks ago, I was debugging data issues on the mturk-tracker demographics site. Claude Code would write a diagnostic script. I would deploy it to the server. I would copy the output. I would paste it back into Claude. Claude would write the next script. I would deploy that one. Copy. Paste. Deploy. Copy. Paste. Deploy.

I was not managing an AI agent. I was its copy-paster. Claude did the thinking. I did the Ctrl-C, Ctrl-V.

That was problem number one.

Problem number two: I needed to collect data from several websites, a process that would take a day or two of continuous scraping. Claude started the work, but the sandbox kept timing out. The session would die, I would restart it, Claude would pick up where it left off, and then the session would die again. The only way to keep things moving was to babysit: Break the bigger task into smaller subtasks and then "Do next task." "Do next task." "Do next task." Over and over. I was not reviewing or directing anything. I was just pressing the button to keep the machine running. I understand that this is our new role as humans, serving our new AI overlords, but... boooooring.

Problem number three: I needed to train a model that required a GPU. The Claude Code sandbox does not have GPUs. So I had to manually launch a VM on Google Cloud, SSH into it, clone the repo, install the dependencies, start the training, and then remember to check back later and shut the machine down before it burned through my budget. Claude had written all the training code. But the last mile (getting it to actually run somewhere with the right hardware) was entirely on me. The AI writes the code. The GPU does the math. And I am the guy who forgets to shut down the machine. Guess which component has the highest error rate.

Three different problems. Same root cause. The sandbox is a walled garden. Claude can think, it can code, it can analyze. But it cannot reach the outside world. It cannot talk to a server, run something overnight, or spin up a machine with a GPU. Everything that requires infrastructure beyond a small ephemeral container? That is your job.

The fix: give the agent a cloud account.

What changes once the agent has cloud access

Remember the mturk-tracker debugging? With cloud access, Claude deploys its own diagnostic scripts to Cloud Functions, runs them against the live data, reads the results, and iterates. No copying. No pasting. No human in the middle.

The web scraping that required me to babysit? Claude deploys the scraper as a Cloud Function with a scheduler. It runs every 15 minutes, stores results in a Cloud Storage bucket, and I check in the next day. I literally went to sleep and woke up with the data collected.

The GPU training? Claude launches a VM with the right specs (say, an n1-standard-4 with a T4 GPU), clones the repo, installs everything, starts training, and sets up a shutdown script that kills the machine when the job finishes. Results go to Cloud Storage. I went to dinner. When I came back, the model was trained, the results were in the bucket, and the VM was already off. The alternative was me manually SSH-ing into a machine, running htop every twenty minutes, and hoping I remembered to shut it down before I went to bed. (Ask me how I know that "hoping I remember to shut it down" is not a reliable cost management strategy.)

The setup (yes, there is some setup)

I will walk through this using Google Cloud, since that is what I use (the concepts are the same for AWS and Azure). If you do not already have a Google Cloud account, go to cloud.google.com and sign up.

Once you have an account, create a project in the Cloud Console. A project is Google Cloud's way of organizing resources and billing. Click the project dropdown at the top, click "New Project," give it a name, and note the project ID.

You do not need to install anything on your own computer. When you need to generate a token, you will use Google Cloud Shell: a browser-based terminal with everything pre-installed.

My pattern: one repo, one cloud project, same name

Every GitHub repo I work with gets its own dedicated Google Cloud project. And they get the same name. The repo paper-oral-exams gets the Cloud project paper-oral-exams. The repo course-ai-pm gets the Cloud project course-ai-pm.

Why? Mostly resource isolation. The agent for the course repo cannot accidentally touch the research data. Each agent gets exactly the access it needs for its own project and nothing else. It also makes housekeeping easier: when everything for a project lives in one Cloud project, you can quickly spot which storage buckets, databases, and VMs are still needed and which are leftovers. No more "wait, whose VM is this and why is it still running?"

Creating a Cloud project is free and takes 30 seconds.

Service accounts: giving the agent its own keys (not yours)

When you use Google Cloud, you log in with your Google account. But an AI agent is not you. And more importantly, it should not be you. Your Google account has access to everything: your email, your billing, your entire cloud infrastructure. Giving all of that to an automated tool would be like handing your intern the keys to the building, your credit card, and your Netflix password. Just in case.

Instead, you give the agent a service account: a restricted identity designed specifically for automated tools. It has its own email address (something like claude-agent@my-project.iam.gserviceaccount.com) and you decide exactly what it can do. Read from this storage bucket. Deploy this function. Query this database. Nothing more.

A caveat: the approach below (encrypting a service account key in the repo) is a pragmatic workaround for agent environments that do not yet support proper workload identity or secret stores. If the worst case is 'the agent ran up a $200 bill on a research project,' you are fine. If the worst case involves production data or your personal credentials, use something else. When proper agent identity federation exists, this will get simpler. For now, it is the best approximation available.

The service account authenticates using a key file: a JSON file that acts as its password. Whoever has this file can act as the service account. Which means this file needs to be protected.

But here is the catch: Claude Code runs in a sandbox that resets after each session. The only thing that persists is the GitHub repo. So the key file needs to live in the repo somehow, but committing a plaintext credentials file to a repo is a classic security mistake. (It is so common that GitHub literally has automated scanning to catch people doing it.)

The solution: encrypt the key file and commit the encrypted version. The encryption password lives in an environment variable in Claude Code, which persists across sessions but never enters the repo. At the start of each session, a hook decrypts, authenticates, and deletes the plaintext immediately. The encrypted file is useless without the password. The password is useless without the encrypted file. And if in the worst case scenario your password leaks, you only exposed the service account with limited permissions, and you can always deprecate and regenerate the credentials.

The five-minute walkthrough

You do this once per repo.

Step 1: Set your encryption password.

In Claude Code, open the environment settings for your session and find the "Environment Variables" field. Add a new variable:

CLOUD_CREDENTIALS_KEY=your-strong-passphrase-here

(If you work with multiple cloud providers across different repos, you can use provider-specific names like GCP_CREDENTIALS_KEY or AWS_CREDENTIALS_KEY instead.)

A caveat: Claude Code currently warns against putting secrets in environment variables because there is no dedicated secrets store yet. I am using this approach because the passphrase only protects an already-restricted service account, not your personal cloud credentials. When a proper secrets store ships, this workflow will use it.

Step 2: Install the skill.

Open your repo in Claude Code and tell the agent:

"Install the cloud-bootstrap skill from https://github.com/ipeirotis/cloud-bootstrap into this repo."

(For those comfortable with a terminal, you can also run curl -sSL https://raw.githubusercontent.com/ipeirotis/cloud-bootstrap/main/install.sh | bash in any environment with access to the repo.)

Step 3: Tell the agent to set up cloud access.

"Set up GCP access for this project."

The agent will ask you for your Google Cloud project ID. Then it will look at your repo and propose a set of minimum permissions: "Based on this repo, I think the service account needs access to Cloud Storage and BigQuery. Here is why. Shall I proceed?" You approve or adjust. For a new or empty repo, it will ask what you plan to do first.

Then the agent will ask you to run a command in Cloud Shell. To open it, go to shell.cloud.google.com or click the ">_" icon in the top-right of the Cloud Console. Make sure you are in the right project, and run:

gcloud auth print-access-token

You paste the result back. This gives the agent a temporary token (valid for one hour) to do the initial setup. The agent creates the service account, grants the approved permissions, generates a key, encrypts it, commits the encrypted file, and sets up an automatic authentication hook for future sessions. The temporary token expires. From this point on, every new session starts fully authenticated. You just start working.

For teams: each person gets their own encrypted key file with their own password. The README has the details.

What to do once the ceiling is gone

Once cloud access is set up, the agent will start proactively suggesting cloud improvements when it notices opportunities: "Would it help if I moved this dataset to BigQuery so we do not have to re-process it every session?" You can also prompt this explicitly: "Can you improve your process, knowing that you have access to GCP?"

I had a dataset too large to fit in the sandbox. The agent uploaded it to BigQuery. Now I query it conversationally: "Show me the distribution of response times by condition." The agent writes the SQL, runs it, brings back the results. The data lives in the cloud permanently. No re-uploading, no re-processing.

I needed to run a survey for a research study. The agent deployed a Cloud Function with a simple web form, backed by a database. Participants visit a URL, submit responses, the data lands in a table I can query later. No server to manage. No hosting to configure. Thirty minutes from "I need a survey" to a live URL that participants were already clicking on. I still have not fully processed how absurd that is.

What does it cost? Less than you might think. Cloud Functions and BigQuery queries cost cents per run. A T4 GPU VM runs about $0.35/hour. My monthly bill for all of this is usually under $10, though a long GPU job will cost more. One practical tip: set up a billing budget alert in Google Cloud before giving the agent access. Agents can get stuck in loops, and a $10 budget alert is cheaper than finding out the hard way.

The bigger picture: finding the next loop to close

There is a trajectory here worth naming. First, the AI learned to generate: write a script, draft a document, produce code. Then it learned to execute: run the script, push the changes, create a pull request. Now it is learning autonomy: spin up a server, run the job, shut down the server, and report back. Each step closes a loop where a human used to be the connector.

The previous post gave the agent memory and a workflow. This one gives it infrastructure. Same pattern: every time you find yourself doing grunt work to connect two things that the agent should be able to connect on its own, that is a loop waiting to be closed.

What comes next

The cloud-bootstrap skill supports GCP, AWS, and Azure. It handles first-time setup, adding team members, and credential rotation (it tracks credential age and warns you after six months). It also supports multi-provider setups in the same repo and handles permission escalation gracefully: if the agent hits a permission wall, it stops and tells you exactly what role it needs and why. It never silently fails, and it never gives itself more access.

This is still early. The whole approach (encrypting credentials in a repo, pasting short-lived tokens) is a workaround, as I noted above. When proper agent identity federation arrives, this will simplify considerably. But right now it works, and for isolated research projects with tightly scoped permissions, the risk is manageable.

But the agent can still only work inside the one repo it is connected to. It cannot clone a second repo, pull in a dataset from another project, or push results somewhere a collaborator can see. It can work inside one room but cannot walk between rooms. The next post will fix that: installing gh and setting up a GitHub personal access token so the agent can move freely across repos. It is a much shorter setup than this one.

After that: the "master repo, satellite repos" setup for coordinating work across multiple projects (which needs the GitHub token to work), MCP configuration for integrating Gmail and Google Calendar, and more on the "council of LLMs" approach I have been using for grading oral exams and for reviewing my work.

But start here. Give the agent a cloud account. And then go to dinner. When you get back, the agent will have finished collecting data, training the model, shut down the GPU VM, clean up everything, and gone to sleep. Your kids, on the other hand, if they are like mine, will still be awake and making fun of the parental controls on their iPads, and the kitchen is a mess.

Wednesday, March 4, 2026

"Let's Work on the Next Task": Claude Code, GitHub, and the Most Diligent Project Manager I've Ever Had

In my previous post, I described how working with AI agents felt like managing an infinitely large, infinitely diligent team. I wrote about pairing Claude with GitHub, giving it context files and task lists, and watching it come back with actual deliverables.

After that post, I got questions from a lot of people asking how to actually set this up. Even from people I assumed were already using this kind of workflow. Turns out it was far less common knowledge than I previously thought. (I guess I am spending too much time reading social media.)

So this post is a step-by-step guide for those who still use AI tools in the "chat" form and want to examine a first setup of "agentic AI". In this case, it is not to get the AI to be a software engineer, but rather get the AI to becoming your project manager and your team of research assistants.

We will set up a GitHub repository, configure Claude Code on the Web, and build a workflow where AI plans (or does) the work and you do the reviewing.

One caveat: while you do not need to know how to code, familiarity with software development practices will help. Not the programming itself, but the process: how developers organize projects, track changes, review each other's work. This post will walk you through those practices.

First, though, let me explain why this setup is so powerful.

The real trick: The repo is the context

Here is the problem with using AI through a regular chat interface. Every time you start a new conversation, you are starting from zero. You paste in your document, re-explain what the project is about, remind the AI where you left off, describe what needs to happen next. It is like hiring a brilliant contractor who gets amnesia every morning.

GitHub solves this. When Claude Code connects to your repository, it does not just see your files. It sees everything: the project structure, the notes about what the project is, the task list, the record of what has already been done, the decisions you have made along the way... All of it, sitting right there in the repo, ready to be read.

This means your prompt for most interactions becomes absurdly simple:

"Let's work on the next most important task."

That is all. Claude reads your CLAUDE.md to understand the project. It reads your TASKS.md to figure out what needs doing. It looks at the existing files to understand the current state. And then it gets to work. No pasting. No re-explaining. No "as I mentioned in our previous conversation..." The repository is the conversation. It is the memory. It is the context.

Read about CLAUDE.md and TASKS.md and you are worried that this is some black magic? Nah, these are just regular text files, written in plain English. We will describe them next.

Wait, what is Claude Code on the Web?

First, some context. Claude Code started as a command-line tool. You would install it on your computer, open a terminal, and type commands. Powerful, but intimidating if you are not a developer.

Then Anthropic launched Claude Code on the Web. Now you can do the same thing directly from your browser. You connect a GitHub repository, give Claude a task, and it clones your repo, writes code (or documents, or reports, or whatever you need), and pushes the changes to a branch. You review the changes, approve them, and merge. All from a web interface. No installation.

Claude Code on the Web operates inside a real computing environment called the "sandbox". It can read your files, create new ones, run scripts, and push changes to GitHub. It tends to write software for performing various tasks, instead of replying in plain text. It does work. Real work. The kind you would normally delegate to a research assistant or a junior colleague.

The 10-minute setup: GitHub + Claude Code

OK, let us build this from scratch. I will assume you have zero GitHub experience.

Step 1: Create a GitHub account and a repository.

Go to github.com and sign up. Then create a new repository: click the green "New" button, give it a name (something like my-research-project or quarterly-report), make sure to set it to Private (not Public, unless you want the whole internet reading your drafts), and check "Add a README file." That last part matters. Write a short description of your project in the README. Even a couple of sentences is fine. This initializes the repo so that Claude Code can actually work with it. (An empty, uninitialized repo will cause problems.)

Step 2: Connect your repo to Claude Code.

Go to claude.ai and open Claude Code (it is in the left sidebar, or you can go directly to claude.ai/code). Start a new session and connect your GitHub repository. You can paste your repo URL directly or use the built-in GitHub integration to browse your repositories. Claude will ask you to authenticate with GitHub the first time (a one-time OAuth flow) and install Claude on the Github repo (that allows Claude to write to the repo). Select the repo you just created.

Now Claude Code can see your files, and more importantly, it can change them.

At this point, you can upload files that you have about the project to the repo, or you can defer that step for later and move on to the next step.

Step 3: Let Claude set up your project.

This is where it gets interesting. CLAUDE.md is a special file that Claude reads at the start of every session. It is the project's "master plan": what the project is about, how it is organized, what conventions to follow. But you do not need to know what it should look like. Just describe your project in plain language:

"This repo contains the data and analysis from our AI-powered oral examination system, which I wrote up as a blog post. I want to turn this into a research paper for submission to Communications of the ACM. The data and some initial analysis scripts are already in the repo. Set up the project structure for a CACM submission and create a CLAUDE.md file."

Claude will read through the existing files, figure out what is there, organize everything into a sensible structure, and create a CLAUDE.md that might look something like this:

# Project: AI-Powered Oral Examinations at Scale

## Overview
Research paper for Communications of the ACM describing our system
for conducting and grading oral examinations using conversational AI
agents and a multi-LLM grading approach.

## Submission Details
- **Journal**: Communications of the ACM
- **Format**: ACM `acmart` document class, `acmsmall` style
- **Page limit**: 12,000 words including references
- **Style**: Author-year citations (natbib)

## Structure
- `/paper/` - LaTeX source files and ACM style files
- `/data/` - Exam transcripts, grading data, survey responses
- `/analysis/` - Python scripts for statistical analysis
- `/figures/` - Generated plots (PDF format, generated from scripts)
- `/blog/` - Original blog post and supporting materials

## Conventions
- All figures must be generated from scripts in `/analysis/`,
  never created manually
- Use BibTeX for references (`references.bib`)
- Data files are never edited directly; all transformations
  happen through scripts in `/analysis/`
- Student data must be anonymized in all outputs

## Current Status
See TASKS.md for the current task list and priorities.

Notice: you did not write any of this. You described your project, and Claude produced the project master plan. You review it, maybe tweak a couple of things. Done.

Step 4: Create your TASKS.md file.

This is your project's to-do list. But unlike a regular to-do list, it serves double duty: it tells Claude what needs to be done and keeps a record of what has been completed. Ask Claude to create it:

"Create a TASKS.md file with the following initial tasks..."

Here is what one might look like:

# Tasks

## In Progress
- [ ] E1. Expand blog analysis into formal experimental evaluation
- [ ] E2. Inter-rater reliability analysis (human vs. LLM council grades)

## To Do
- [ ] E3. Create Figure 1 (grade distribution across grading methods)
- [ ] R1. Write Related Work section (AI in assessment, LLM-as-judge)
- [ ] D2. Analyze anti-cheating detection rates
- [ ] Z3. Check word count against CACM 12,000-word limit

## Done
- [x] Z1. Set up project structure from blog post materials
- [x] D1. Anonymize student data
- [x] I1. Write Introduction draft

Now here is the magic. You can point Claude at a specific task and say: "Work on the next task in TASKS.md." Claude reads the file, picks the next item, does the work, updates the task status, and creates a pull request with its changes. If you are not familiar with pull requests, more in a moment.

Pull requests: Redlined documents for coders (and not only)

Now the part that is unfamiliar to people who are not software engineers. The "pull request".

If you have ever received a redlined document from a lawyer, or reviewed tracked changes in a Word file, you already understand pull requests. The concept is that simple: someone proposes changes, you review them before they get incorporated into the main document.

In GitHub, it works like this:

Claude does its work on a separate branch (a parallel copy of your project).
When it is done, it creates a pull request (PR), which says: "Here are the changes I made. Want to incorporate them?"
You see a clean diff view showing exactly what was added, removed, or modified. Green lines are additions. Red lines are deletions.
You review. You can approve, request modifications, or reject.
If you approve, you click "Merge" and the changes become part of the main project.

This is the standard process used by every software team in the world. And it works for any kind of knowledge work that relies on text. Research papers. Reports. Course materials. Business proposals. Anything that lives in files. Ideally, you want the files to be text files and not binary ones; tex good, PowerPoint files, not so much. In the future we may have better tooling for reviewing changes in Office files or other formats, but for now the process works best for text-based files.

Fair warning: the GitHub interface will look busy the first time you open a pull request. Do not panic. Just look for the "Files changed" tab to see the redlines, and the big green "Merge pull request" button when you are ready to accept.

The critical point: you never edit the files directly. You describe what you want, Claude proposes changes, and you review and approve. You are the manager. Claude is the diligent employee who comes back with deliverables for you to inspect. And the audit trail is far better than "Track Changes" in Word ever was.

A real example: From CSV to submission-ready in two hours

Let me show you how this plays out in practice with a real example from last month.

I was working on a paper that had a case study section (say, Section 8) where we discussed results from a partner's dataset, but we only had the final business conclusions, not a full experimental analysis. The rest of the paper (say, Section 7) had a proper, thorough analysis on a different dataset: figures, tables, bootstrap confidence intervals, the works. By comparison, the case study in Section 8 was the weak sibling, and reviewers have flagged that. We have received a detailed dataset from our partners, but it required work. My TASKS.md had this sitting in it:

## Backlog
- [ ] F5. AML dataset analysis
- [ ] G1. Complete §8 rewrite with AML dataset

I uploaded the CSV to the repo and told Claude:

"Here is the AML dataset. Replicate the analysis from Section 7 but now for Section 8. Use the existing details from Section 8 as the background and framing, conduct the full experimental analysis, and generate a new Section 8."

Claude read Section 7 to understand the methodology. It read the existing Section 8 to understand the framing and context. It wrote Python scripts to process the AML data, generated four figures and three tables with bootstrap confidence intervals, wrote the new section text with all quantities pulled from the analysis scripts, and submitted a pull request with everything.

Less than an hour. I spent another hour reviewing the PR, checking the code, leaving comments ("clarify this axis label," "move this paragraph before the table", "I do not think the conclusions follow from the results"), and merging.

Two hours total. For a PhD student, this would have been a few days of work, easily. And here is the part that matters: every single number in that section was generated through a Python script. Every figure had a script that produced it. Reproducibility was built in from the start, not bolted on after the fact. The pull request showed me exactly what was added: the scripts, the outputs, the LaTeX changes. I could trace every claim back to the code that produced it.

Needless to say, I remain fully accountable for any bugs or errors. At the end of the day, I have reviewed the scripts, the results, and the text. What I can say is that even if there are errors, these are not "hallucinations" where the LLM filled in random numbers or references in the text. The figures are Python-generated from the raw data, the tables and the numbers in the text the same. The errors can come from bugs, or other oversights. But we should stop calling all AI errors "hallucinations". At this point, the errors are not the errors of a "bullshitter in chief" (a title aptly earned by early LLMs); they are the same types of errors that a junior colleague may make when carefully executing a well-defined task: misreading a specification, applying a method slightly outside its intended scope, or missing an edge case that a more seasoned eye would have caught.

Beyond software: Why this works for all knowledge work

I want to be explicit about something: this is not just for code. GitHub repositories can hold any kind of file. Markdown documents, LaTeX papers, CSV data files, images, PDFs. The pull request workflow works for anything.

Writing a consulting report? Put the markdown draft in /report/, the supporting analysis in /data/, the charts in /figures/. Claude generates the analysis, creates the figures, and drafts sections of the report, all as reviewable pull requests.

Same idea for course materials (I use this with my exit tickets workflow), business plans, grant proposals. You define the project structure, you maintain a task list, and you let the agent do the work while you review proposals. Standard software engineering practice, applied to everything.

Leveling up: More files for better project management

Once you get comfortable with CLAUDE.md and TASKS.md, you can add more structure. The files I have found most universally useful are these three:

SCHEDULE.md — Deadlines and milestones. "The submission deadline is March 15" becomes a constraint that shapes which tasks get prioritized first.
DECISIONS.md — Key choices and their rationale. "We decided to use three LLMs in the grading council instead of five because the marginal improvement was negligible." Prevents you and Claude from relitigating settled questions two weeks later.
STYLEGUIDE.md — Your writing preferences. "Never use em-dashes," "Never use fluffy adjectives," "Avoid claims not supported by data or citations." Good trick: give Claude a few pieces of your favorite writing and ask it to generate a style guide that mimics your voice. Then drop it in the repo.

Beyond these, there are files worth adding for specific situations:

CHANGELOG.md — Human-readable log of what changed each session. Especially useful when preparing a response to reviewers.
BLOCKERS.md — Things waiting on someone external. Makes it easy to send a collaborator a list of "here is what I need from you."
FEEDBACK.md — Running log of all feedback received, formal and informal, with status: pending, accepted, or rejected with rationale.
SOURCES.md — Annotated bibliography: what each source is useful for, how reliable it is, which sections cite it.
GLOSSARY.md — Keeps terminology consistent across a long document. Claude consults it and adds new terms as they come up.
DEPENDENCIES.md — Maps how artifacts depend on each other. Lets Claude flag when an upstream change invalidates something downstream.

You do not need all of these on day one. Start with CLAUDE.md and TASKS.md. Add CHANGELOG.md when editing a paper that came back with revisions. Add the rest as your project grows and you find yourself needing them.

To be fair, this is a bit of a hack. We are simulating standard project management tools using plain markdown files. Scanning text files for task lists and decisions is not exactly elegant. And I have serious doubts that this can scale for projects involving hundreds of people. But it works for now, with tools that exist today, for the projects that I am working on.

In the future, agents will have proper interfaces: structured databases, purpose-built PM tools designed for agents to read and write directly, not markdown files they have to parse every session. We are in the duct-tape-and-baling-wire phase. It is fine. The duct tape holds.

The awkward part (and why it is worth it)

If you are not a software engineer, this workflow feels strange at first. You are used to opening a document and typing. Now you are writing instructions, waiting for an AI to propose changes, and clicking "Merge" on a pull request. It is indirect. It feels like you are adding a middleman.

But here is what happens after a week: you realize the middleman can do 80% of the work. And the 20% you are doing (reviewing, giving feedback, making decisions) is the work that you would have done with any apprentice. But you are not fixing typos, you are not formatting tables, you are not wrestling with matplotlib's axis labels. You are reading the output and deciding if it is good and trustworthy enough.

Coming next

This post covered the basics: one repo, one project, Claude Code on the Web doing the work. The whole secret is that now the chatbots can write down what they have done, and look up the notes next time you start working together. And it is ridiculously powerful.

But this is just the beginning.

In upcoming posts, I will describe my "master repo, satellite repos" setup, where I maintain a central task management repository that coordinates work across multiple projects with different collaborators. Think of it as the command center. I will also walk through my MCP (Model Context Protocol) configuration for integrating Gmail and Google Calendar directly into Claude Code, so the agent can check my schedule, draft emails, and coordinate meetings as part of its workflow.

Beyond that: deploying resources on Google Cloud, spinning up virtual machines for heavy computation, and the "council of LLMs" approach where Claude, Gemini, and GPT deliberate together on evaluation tasks (something I have been using for grading oral exams and am now extending to research).

At some point (in the not so distant future, probably by the end of March or so) Claude will be scheduling my meetings, answering my emails, and assigning me tasks from my own task list. I am not entirely sure who is managing whom anymore.

Sunday, February 15, 2026

Listening to My Students at Scale: Exit Tickets, NotebookLM, and the Tightest Feedback Loop I've Ever Built

It started at a teaching workshop, last semester: Craig Kapp and Rob Egan presented a seminar at the NYU Center for Teaching and Learning called "Real-Time Insights: Leveraging AI for Responsive Teaching in Large Classrooms." They (re-)introduced a deceptively simple concept: the exit ticket. The idea is that at the end of every class session, you ask students three quick questions, each with a different shape metaphor:

🔵 Circle: What is still circling in your mind? (What are you confused about?)
🟥 Square: What "squared" with your understanding? (What clicked today?)
🔺 Triangle: What are three key takeaways from today's session?

Then, take these answers, and use LLMs to process them quickly and get feedback before the next session.

Getting structured feedback from students after every single session? Not at the end of the semester when it's too late to change anything, but right now, while you can still do something about it? I immediately wanted to try it.

Below I describe the details of the approach presented by Craig and Rob, and my own adjustments to the recipe. Hope you will find it useful.

The setup: Making it required (and why that matters)

It starts by setting up the exit ticket surveys as auto-graded quizzes on Brightspace (NYU's LMS). The auto-grading part is a nice little trick: one of the questions is simply "Select True in this question to get your points." Students complete the survey, they get their credit. No manual processing of ~50 submissions on my end.

We do tell students upfront: write something substantive. Don't game the system. We reserve the right to deduct points if someone slacks through the exit tickets all semester. And here's the nice irony: since we're already running AI-powered analysis on the responses, identifying freeriders who type "asdf" every week is trivial. The same pipeline that processes the feedback also flags the people not providing any.

The critical design decision: make it part of the grade, not optional. Optional feedback gets ~30% response rates and self-selected complainers. Required feedback gets everyone. And because this is formative feedback (not evaluative), students have every reason to be honest and detailed. They're not rating me. They're telling me what they need.

Compare this to the end-of-semester evaluation. Students fill it out in December, the professor reads it in January (maybe), and any changes happen next year for a completely different group of students. The feedback loop is so long that it barely qualifies as a loop. Exit tickets close that loop within days. Sometimes hours.

From exit ticket to next session: the processing pipeline

So now I have all this feedback. ~50 students, after every session, telling me what confused them, what clicked, and what they're taking away. The question becomes: how do you actually process all of that quickly enough to act on it?

NYU IT built an official path for this, which Rob demonstrated in the seminar. You export the exit ticket responses into the Brightspace Insights Portal (which Rob's team manages) and run AI-powered analysis using a prompt like this:

You are an expert Instructional Designer and Data Scientist assisting
a professor with the course "AI/ML Product Management" at NYU Stern
School of Business (undergraduate).

Your goal is to analyze student feedback survey data to improve course
delivery. The survey questions and student answers are provided below.
Please perform the following two steps:

### Step 1: Thematic Analysis
Analyze the responses to identify key themes. Do not just look for
keywords; look for semantic similarities and underlying sentiment. For
each theme, provide:
1. **Theme Name**: A concise title.
2. **Prevalence**: The approximate number of students who mentioned this.
3. **Explanation**: A brief summary of the sentiment or issue.
4. **Evidence**: A direct, representative quote from the data.

### Step 2: Actionable Pedagogy (Bloom's Taxonomy)
For each theme identified above, propose a short course activity.
* If the theme represents a **knowledge gap/pain point**, propose a
  remedial activity.
* If the theme represents a **strength/interest**, propose an activity
  to deepen understanding.
* **Constraint**: The activity must be supported by Bloom's Taxonomy.
  Explicitly state which level of Bloom's Taxonomy the activity targets
  (e.g., Application, Analysis, Evaluation).

**Format**:
Start the suggestion section for each theme with the label: "PRACTICE IDEA".

I attach the survey data.

It's a well-designed prompt. Thematic coding, prevalence counts, representative quotes, remedial activities aligned with Bloom's Taxonomy. The output is genuinely useful.

But I prefer to do something slightly different. I use the same prompt from the Insights Portal, but I run it inside NotebookLM with just the student feedback as input. For those unfamiliar: NotebookLM is Google's AI-powered research assistant. You upload your own documents, and it generates analysis, summaries, explainer videos, and podcast-style audio overviews grounded entirely in your uploaded sources. NYU provides institutional access through Google Workspace, so the data never trains any AI models, which matters when you're working with student feedback.

Why NotebookLM over the Insights Portal? Because the exit ticket analysis is just the starting point. What I really need is to prepare the follow-up material. Once NotebookLM identifies the themes and suggests activities, I take those suggestions and combine them with my lecture slides, readings, and case studies (which are already loaded in the same notebook). Then I ask it to generate explainers, videos, infographics, and targeted activities that address the confusion, all grounded in my actual course content.

The Insights Portal gives me a diagnosis. NotebookLM gives me the diagnosis and helps me build the treatment.

My workflow after every class:

Students complete the exit ticket on Brightspace (takes them 2-3 minutes)
I export the responses and upload them into a NotebookLM notebook, together with the materials for that session
NotebookLM identifies the themes: what's confusing people, what clicked, what they found most valuable
Based on those themes, I generate explainer materials, short videos, and targeted activities for the next session

(As an example, here is the NotebookLM that we use for the Zillow Offers case, which we use to discuss leading and lagging metrics, model and output monitoring, concept drift, adverse selection and other product-management-related topics. Note: this notebook contains only course materials for preparing the case discussion, not student feedback data.)

One small but annoying wrinkle: NotebookLM's default slide output has that unmistakable "AI-generated" aesthetic. You know the one. (Yes, they are visually gorgeous compared to my own slides, but after a while it starts feeling a bit like slop.) So I started uploading the NYU brand style guide as an additional source in my notebooks, and prompting NotebookLM to follow it when generating visual materials. The results are noticeably closer to proper NYU-branded slides. Not perfect, but much better than the generic AI look. I'm still waiting for NotebookLM to support custom templates or branding natively, but that's a different story.

The per-session overhead is maybe 15-20 minutes.

Why this actually works

The circle/square/triangle structure does something clever: it gives students permission to be confused. "What is still circling in your mind?" is a much less intimidating question than "What don't you understand?" And the three-takeaways question forces them to reflect, even briefly, which helps consolidate their learning.

But the real reason students engage is that they see the results. When I open the next class by saying "Several of you mentioned you were confused about X, so let's spend 15 minutes on this before we move on," students learn that their feedback actually matters. It creates a virtuous cycle: they write thoughtful responses because they know I'll respond, and I can respond because NotebookLM makes processing all the responses feasible. Without the AI assist, no professor has time to synthesize free-text responses from 50 students after every class and create targeted follow-up materials. Definitely not after every single session. The economics just don't work.

With NotebookLM doing the heavy lifting? The economics suddenly work beautifully.

The exit ticket has been around for decades. Craig and Rob simply showed how to supercharge it with AI. The hard part was never getting students to talk. It was finding the time to listen. Once students realize someone is actually listening, they start saying things worth hearing. That's the loop. That's the whole trick.

Wednesday, February 11, 2026

Everybody Is a CEO Now (And What Exactly Am I Doing Here?)

It's hard to pinpoint the exact moment when something fundamentally shifts. There's no day when you wake up and say, "Today, everything is different." It's more like boiling a frog. Except in this case, the frog is me, and the water feels amazing.

Over the last few weeks, a confluence of AI developments crossed an invisible threshold. None of them is dramatic on their own. All of them, together, are profoundly changing how I work, how I teach, and honestly, how I think about what comes next.

Claude stopped being a chatty know-it-all

Let me start with the most concrete thing. Around December, Claude became... different. Not in some flashy, press-release way. It just started being right. Consistently, reliably right. The suggestions were spot on. The reasoning was good. The writing did not feel like fluffy AI slop. The output needed minimal editing.

I know, I know—"AI is getting better" isn't exactly breaking news. People have been saying this for years. But there's a qualitative difference between "impressive compared to what we had before, but I still need to direct and edit this very carefully" and "I now trust this thing with real work." We crossed that line.

Here's the moment it hit me. Yesterday, I had a brainstorming session with a student. We shared documents, exchanged ideas, sketched out some research directions. Normal academic stuff. Afterwards, I dumped my messy meeting notes into Claude and asked it to organize them.

What came back was not just a cleaned-up document with better formatting.

It was a research program.

Legitimate research questions, well thought out, properly scoped, organized into a coherent agenda with clear methodological approaches. I sat there staring at my screen. I did not feel like I was a professor advising a student and making some progress. It felt like we were in reality two grad students who had been goofing around with half-baked ideas, and then our wise, respected senior professor walked into the room, sat down, and said: "OK, here's how research is actually done. Here's how you think about this. Here's how you organize your work."

Not a helpful assistant anymore. Claude was setting the agenda this time around. It was the senior colleague. It was the advisor.

The Agent That Puts PhD Students to Shame

And then there's the agent setup, which is where things get truly surreal.

When you pair Claude with GitHub for memory, an AGENTS.md file for context, and a TODO.md for task tracking, something clicks. The AI labs have been saying for a while that their agents were reaching "PhD student level." I've supervised PhD students for 20 years. I love them. Truly. But let me be blunt: I have never worked with a PhD student this organized and this diligent.

None of them have ever created a table mapping every data-driven claim in the LaTeX code to the specific code and data files that support each claim. None of them has had a full pipeline for the data analysis and the figures in a makefile, ready to repeat everything if necessary. None of them has had a reproducibility package ready before we even sent out the first manuscript.

The only downside? I will not be able to have drinks with this PhD student in the future and feel happy seeing them be so much more successful than I am.

A paper is about to go out. I started writing in earnest on Saturday. It took a total of four days of work to get to a submittable manuscript. The experimental analysis, the writing, the polishing. Four days. This would have taken four weeks minimum with a human collaborator, and that's being generous. And the quality isn't "good enough for a draft." It's "ready for submission with minor tweaks."

I find myself glued to my screen all day. I am not doing busy work. I write down what needs to be done, and this is happening behind the scenes. I am getting back the next iteration in an hour, I look at it, I give feedback, we cross things out from the TODO.md and we move forward. This is real work being done. Not just coding. Paper writing. Report preparation. Coding practices leak into other types of work, and things are moving. My real work is getting done, not just my academic software prototypes.

It's like having an infinite pool of employees, each one eager, competent, and ready to come back with actual deliverables. Not drafts that need to be rewritten. Not outlines that need to be fleshed out. Deliverables.

Teaching as Curation: The NotebookLM Story

Let me tell you about another shift that's been happening in parallel, this one in our classroom.

We teach an AI Product Management course at Stern, and starting in November, something strange happened to how we prepare. We stopped creating content. We started curating it.

Here's our workflow now: After every class session, we collect student feedback. What clicked, what didn't, what questions came up, what topics generated the most energy. We dump all of this (the feedback forms, our own notes, relevant articles, the previous session's materials) into NotebookLM.

And then we ask it to help us design the next session.

NotebookLM digests the student feedback, identifies the gaps, suggests educational activities, and creates new explainer material that directly addresses what students found confusing or wanted to explore further. It connects themes across sessions that we might not have noticed. It proposes case studies that are relevant to the questions students actually asked, not the ones we assumed they'd ask.

The result? The course is absurdly adaptive. Every session builds on what students actually need, not on a syllabus we wrote in August. We're not creating lectures from scratch anymore. We're curating a learning experience, with AI as our editorial partner. The student feedback loop, which used to inform maybe the next semester's version of the course, now informs the next class.

We feel like careful curators, because we're still the ones making the final calls. For now. For how long? No idea. Perhaps in Summer even the curation will be something the AI does better than us.

Education is changing. Bloom's two sigma problem, the finding that one-on-one tutoring outperforms classroom instruction by two standard deviations, is solvable. Now. What is our role? No clue. Perhaps the future of education does not need professors. But the future of education is bright. We will not believe how bad we are. Almost like going from writing with a marker on transparencies to having an interactive demo of the concept. That transition took 30 years. Let's see where we will be in 30 months.

So... Everybody's a CEO Now?

Here's where I start to feel a little dizzy. The marginal cost of competence is hitting zero.

If I can supervise an AI agent the way I'd supervise a research team (giving it direction, reviewing output, iterating on results) and if this scales to writing papers, analyzing data, building prototypes, designing courses... then what am I? I'm a manager. A director. A CEO of a one-person company with an arbitrarily large AI workforce.

But here's the question: What happens when everyone can do this?

When every professor can produce research at 10x the speed. When every consultant can deliver analyses that used to require a team of five. When every entrepreneur can build and ship products without hiring engineers. When every student can produce work indistinguishable from an expert's.

Do we still need employees? Is it even feasible for everyone to operate like a one-person business? And if so, who are the customers? If everyone is a CEO, who is buying?

I don't have answers. The words people have been saying for the last few years, "AI will change everything," "this is the new industrial revolution," "knowledge work will be transformed," those words haven't changed.

But the feeling has.

It used to feel like a prediction. The prediction is here. You will feel it soon, if you have not felt it already. It will be a mix of awe and fear. Impostor syndrome to the fullest. What exactly am I adding here?

I'd love to tell you that the human role is now "taste, judgment, direction-setting" and that AI just handles the execution. That's the comforting version. But I just told you that Claude set the research agenda, not me. So even that may not hold for long.

Bye now

And for now, if you'll excuse me, I need to go review the deliverables my AI team just submitted. Four papers in the queue, a course redesign in progress, and a blog post that, unlike this one, I didn't write myself.

OK fine, I didn't write this one myself either.

(Kidding. Mostly.)

Monday, December 29, 2025

Fighting Fire with Fire: Scalable Personalized Oral Exams with an ElevenLabs Voice AI Agent

It all started with cold calling.

In our new "AI/ML Product Management" class (co-taught with Konstantinos Rizakos), the "pre-case" submissions (short assignments meant to prepare students for class discussion) were looking suspiciously good. Not "strong student" good. More like "this reads like a McKinsey memo that went through three rounds of editing," good.

And let's be clear: We have zero problems with students using AI for their work. (Banning AI in an AI course? That would be... special.) We actively encourage it. But here's the distinction that matters: using AI to enhance your thinking versus outsourcing your thinking entirely and learning nothing at the end. One of these is education. The other is expensive credential theater.

So we started cold calling students randomly during class.

The result was... illuminating. Many students who had submitted thoughtful, well-structured work could not explain basic choices in their own submission after two follow-up questions. Some could not participate at all. This gap was too consistent to blame on nerves or bad luck. If you cannot defend your own work live, then the written artifact is not measuring what you think it is measuring.

Brian Jabarian has been doing interesting work on this problem, having shown that AI is actually better than humans at conducting job interviews. Why? Humans get tired, have biases, and are less consistent at following a script. His results both inspired us and gave us the confidence to try something that would have sounded absurd two years ago: running the final exam with a Voice AI agent.

Why oral exams? And why now?

The core problem is almost embarrassingly simple: students now have immediate access to LLMs that can handle most exam questions we traditionally use for assessment. The old equilibrium—where take-home work could reliably measure understanding—is dead. Gone. Kaput.

OK, so we go pen and paper in the classroom. We did exactly that for the midterm. Problem solved, right?

Well, not quite. We also needed to ensure that students had done deep work on their group projects. In the past, our worry was freeriding: students offloading their work to teammates. But then, in the middle of our class, the AI landscape shifted dramatically. Gemini 3.0 dropped, and NotebookLM started generating flawless presentations. Suddenly, a student could deliver a polished, sophisticated presentation about a project they barely touched.

And we'd have no way to tell.

Oral exams were the natural response. They force real-time reasoning, application to novel prompts, and defense of actual decisions. No LLM whispering in your ear. No "let me just check something real quick" while ChatGPT generates your answer. Just you, your knowledge, and an evaluator.

The problem? Oral exams are a logistical nightmare.

With 36 students and two instructors, we could maybe manage. But even at that scale, the accommodation requests started piling up immediately. "I have a flight on the 15th." "I have three other finals that day." "I'm traveling for a family event." All legitimate! But multiply that by a factor of ten for a larger class, and you're looking at a month-long hostage situation.

So: oral exams don't scale. Everyone knows this. It's why we abandoned them in the first place.

Unless you cheat.

Enter the Voice Agent

We used ElevenLabs Conversational AI to build the examiner. The platform bundles the messy parts (speech-to-text, text-to-speech, turn-taking, interruption handling, …) into something usable. And here is the thing that surprised me: a basic version for a low-stakes setting (e.g., an assignment) can be up and running in literally minutes. Minutes. Just write a prompt describing what the agent should ask the student, and you are done.

Two features mattered a lot for our setup:

Dynamic variables: pass the student's name, project details, and other per-student context into the conversation as parameters, to allow personalized exams
Workflows: build a structured flow with sub-agents instead of a single "chatty" agent trying to do everything

What the exam looked like

We ran a two-part oral exam.

Part 1: "Talk me through your project." The agent asks about the student's capstone project: goals, data, modeling choices, evaluation, failure modes. This is where the "LLM did my homework" strategy dies. You can paste an assignment into ChatGPT. It is much harder to improvise consistent answers about specific decisions when someone is drilling into details.

Part 2: "Now do a case." The agent picks one of the cases we discussed in class and asks questions spanning the topics we covered: basically testing whether students absorbed the material or just showed up.

To handle this structure, we split the exam into sub-agents in a workflow:

Authentication agent: Asks for the student's ID and refuses to proceed without a valid one. (In a more productized version, we would integrate with NYU SSO instead of checking against a list.)
Project discussion agent: Gets project context injected via parameters. The prompt includes details of each project so the agent can ask informed questions. The next step is obvious: connect retrieval over the student's submitted slides and reports so the agent can quote and probe precisely.
Case discussion agent: Selects a case and runs structured questioning. Again, RAG would help with richer case details.

This "many small agents" approach is not just aesthetic. It prevents the system from drifting into unbounded conversation, and it makes debugging possible.

If you want to try: Link to try the voice agent (use Konstantinos as the name and kr888 as the net id to authenticate; the project was a "LinkedIn Recruiter, an agent that scans profiles and automatically sends personalized DMs to candidates on behalf of a recruiter. It engages in the first 3 turns of chat to answer basic questions (salary, location) before handing off to a human.")

By the Numbers

36 students examined over 9 days
25 minutes average (range: 9–64)
65 messages per conversation on average
0.42 USD per student (15 USD total), but also the $99/month ElevenLabs subscription
89% of LLM grades within 1 point
Shortest exam (9 min) → highest score (19/20)

The economics

Let's talk money.

Total cost for 36 students: 15 USD.

That's 8 USD for Claude (the chair and heaviest grader), 2 USD for Gemini, 0.30 USD for OpenAI, and roughly 5 USD for ElevenLabs voice minutes. Forty-two cents per student.

The alternative? 36 students × 25-minute exam × 2 graders = 30 hours of human time. At TA rates (~$25/hour), that's $750. At faculty rates, it's "we don't do oral exams because they don't scale."

For $15, we got: real-time oral examination, a three-model grading council with deliberation, structured feedback with verbatim quotes, a complete audit trail, and—as you'll see—a diagnosis of our own teaching gaps.

The unit economics in terms of cost work. We will see next that the real benefit is in the value that is delivered, not in the 50x cost savings.

What broke (and how we fixed it)

The first version had problems. Here is what we learned.

1) The voice was intimidating

A few students complained that the agent sounded severe. We had cloned Foster Provost's voice because, frankly, his clone was much more accurate than the clones of our own voices. But the students found it... intense. Here is an email from a student:

I had prepared thoroughly and felt confident in my understanding of the material, but the intensity of the interviewer's voice during the exam unexpectedly heightened my anxiety and affected my performance. The experience was more triggering than I anticipated, which made it difficult to fully demonstrate my knowledge. Throughout the course, I have actively participated and engaged with the material, and I had hoped to better demonstrate my knowledge in this interview.

And here is another:

Just got done with my oral exam. [...] I honestly didn't feel comfortable with it at all. The voice you picked was so condescending that it actually dropped my confidence. [...] I don't know why but the agent was shouting at me.

Fix: We are split on that. We love FakeFoster. But next time we will A/B test, and we will try to test other voices. At the end of the day, we want to optimize for comprehension, not charisma. ElevenLabs has guidance on voice and personality tuning: they treat this as a product design problem, and probably a good idea.

2) The agent stacked questions

This was the biggest real issue. The agent would ask something like: "Explain your metric choice, and also tell me what baselines you tried, and why you did not use X, and what you would do next."

That is not one question. That is four questions wearing a trench coat. The cognitive load for an oral exam is already high. Stacking questions makes it brutal.

Fix: Hard rule in the prompt: one question at a time. If you want multi-part probing, chain it across turns. For grading the exam, we included an "interference protocol": students received full credit if they had questions stacked like that and answered only some of them.

3) Clarifications became moving targets

Student: "Can you repeat the question?"
Agent: paraphrases the question in a subtly different way

Now the student is solving a different problem than the one they were asked. Very frustrating.

Fix: Explicit instruction in the prompt: repeat verbatim when asked to repeat. No paraphrasing. Same words.

4) The agent did not let students think

Humans rush to fill silence. Agents do too. Students would pause to think, and the agent would jump in with follow-up probes or worse: interpret the silence as confusion and move on.

Fix: Tell the agent to allow think-time without probing aggressively. It made the exam feel less like an interrogation. We also increased the time-out before the agent asks "Are you there?" from 5 to 10 seconds.

5) Lack of randomization

We asked the agent to "randomly select" a case study. It did not.

From December 12–18, when Zillow was in the case list, the agent picked Zillow 88% of the time. After we removed Zillow from the prompt on December 18, the agent immediately latched onto Predictive Policing—picking it for 16 out of 21 exams on December 19 alone.

LLMs are not random. They have implicit preferences and ordering biases. Asking an LLM to "pick randomly" is like asking a human to "think of a number between 1 and 10"—you're going to get a lot of 7s.

Fix: Pass an explicit random number as a parameter and map it to cases deterministically. Do the randomization in code, not in the prompt.

Grading: the council deliberation actually worked

OK, so here is where things got interesting.

We graded using a "council of LLMs" approach, an idea we borrowed from Andrej Karpathy. Three models (Claude, Gemini, ChatGPT) assessed each transcript independently. Then they saw each other's assessments and revised. Finally, the chair (Claude) synthesized the final grade with evidence.

Round 1 was a mess. When the models graded independently, agreement was poor: 0% of grades matched exactly, and only 23% were within 2 points. The average maximum disagreement was nearly 4 points on a 20-point scale.

And here's the kicker: Gemini was a softie: It averaged 17/20. Claude averaged 13.4/20. That's a 3.6-point gap—the difference between a B+ and a B-.

Meanwhile, Claude and OpenAI were already aligned: 70% of their grades were within 1 point of each other in Round 1.

Model	Round 1 Mean	Round 2 Mean	Change
Claude	13.4/20	13.9/20	+0.5
OpenAI	14.0/20	14.0/20	+0.0
Gemini	17.0/20	15.0/20	-2.0

Then came consultation. After each model saw the others' assessments and evidence, agreement improved dramatically:

Metric	Round 1	Round 2	Improvement
Perfect agreement	0%	21%	+21 pp
Within 1 point	0%	62%	+62 pp
Within 2 points	23%	85%	+62 pp
Mean max difference	3.93 pts	1.41 pts	-2.52 pts

Gemini lowered its grades by an average of 2 points after seeing Claude's and OpenAI's more rigorous assessments. It couldn't justify giving 17s when Claude was pointing to specific gaps in the experimentation discussion.

But here's what's interesting: the disagreement wasn't random. Problem Framing and Metrics had 100% agreement within 1 point. Experimentation? Only 57%.

Why? When students give clear, specific answers, graders agree. When students give vague hand-wavy answers, graders (human or AI) disagree on how much partial credit to give. The low agreement on experimentation reflects genuine ambiguity in student responses, not grader noise.

The grading was stricter than my own default. That's not a bug. Students will be evaluated outside the university, and the world is not known for grade inflation. (Just in case you are wondering, I graded all exams myself and I asked the TA to also grade the exams; we mostly agreed with the LLM grades, and I aligned mostly with the softie Gemini. However, when examining the cases when my grades disagreed with the council, I found that the council was more consistent across students and I often thought that the council graded more strictly but more fairly.)

The feedback was better than any human would produce. The system generated structured "strengths / weaknesses / actions" summaries with verbatim quotes from the transcript. Sample feedback from the highest scorer:

"Your understanding of metric trade-offs and Goodhart's Law risks was exceptional—the hot tub example perfectly illustrated how optimizing for one metric can corrupt another."

Sample from a B- student:

"Practice articulating complete A/B testing designs: state a hypothesis, define randomization unit, specify guardrail metrics, and establish decision criteria for shipping or rolling back."

Specific. Actionable. Tied to evidence. No human grader has the time to generate that for every student.

It diagnosed our teaching gaps

Ha! This one stung.

When we analyzed performance by topic, one bar stuck out like a sore thumb: Experimentation. Mean score: 1.94 out of 4. Compare that to Problem Framing at 3.39.

The breakdown was brutal:

3 students (8%) scored 0—couldn't discuss it at all
7 students (19%) scored 1—superficial understanding
15 students (42%) scored 2—basic understanding
0 students scored 4—no one demonstrated mastery

We had rushed through A/B testing methodology in class. The external grader made it impossible to ignore.

The grading output became a mirror reflecting our own weaknesses as instructors. Ooof.

Duration ≠ Quality

One finding I found strangely fascinating: exam duration had zero correlation with score (r = -0.03). The shortest exam—9 minutes—got the highest score (19/20). The longest—64 minutes—scored 12/20.

Taking longer doesn't mean you know more. If anything, it signals struggling to articulate. Confidence is efficient.

Anti-cheating (or: trust but verify)

We asked students to record themselves while taking the exam (webcam + audio). This discourages blatantly outsourcing the conversation, having multiple people in the room, or having an LLM in voice mode whispering answers. It also gives us a backup record in case something goes really badly.

And here is an underrated benefit of this whole setup: the exam is powered by guidelines, not by secret questions. We can publish exactly how the exam works—the structure, the skills being tested, the types of questions. No surprises. The LLM will pick the specific questions live, and the student will have to handle them.

This reduces anxiety and pushes students toward actual preparation instead of guessing what the instructor "wants." And it eliminates the leaked-exam problem entirely. Practice all you want—it will only make you better prepared.

What the students said

We surveyed students before releasing grades to capture their experience. Some of the results:

Only 13% preferred the AI oral format. 57% wanted traditional written exams. 83% found it more stressful.
But here's the thing: 70% agreed it tested their actual understanding: the highest-rated item. They accepted the assessment but not the delivery.
At the same time, they almost universally liked the flexibility of taking the exam at their own place and time. Yes, many of them would have also preferred a take-home exam instead of the oral exam, but this format is dead now.
83% of students found the oral exam framework more stressful than a written exam.
The fix is clear: one question at a time, slower pacing, calmer tone. The concept works. The execution needs iteration.

Try it yourself

If you want to experiment with this approach, here are some resources:

Prompt for the voice agent
Prompt for the grading council
Link to try the voice agent (use Konstantinos as the name and kr888 as the net id to authenticate; the project was a "LinkedIn Recruiter, an agent that scans profiles and automatically sends personalized DMs to candidates on behalf of a recruiter. It engages in the first 3 turns of chat to answer basic questions (salary, location) before handing off to a human.")

What I would change next time

Slower pacing and a calmer voice: We love you FakeFoster, but GenZ is not ready for you. Perhaps we will deploy FakePanos next time. Too bad ElevenLabs hasn't perfected thick accents yet to deliver a real Panos experience.
RAG over student artifacts (slides, reports, notebooks). ElevenLabs supports this directly. If the agent can quote the student's own submission, the exam becomes much harder to game and much more diagnostically useful.
Better case randomization with explicit seeding and tracking. Randomness that "feels random" is not enough. Pass explicit parameters.
Audit triggers in grading. If the LLM committee disagrees beyond a threshold, flag for human review. The point of a committee is not to pretend the result is always certain; it is to surface uncertainty.
Accessibility defaults. Offer practice runs, allow extra time, and provide alternatives when voice interaction creates unnecessary barriers.

The bigger point

Take-home exams are dead. Reverting to pen-and-paper exams in the classroom feels like a regression. In our case, we wanted to check that the students who worked in the team projects actually contributed and understood what they submitted; we would not be able to do that with pen-and-paper exams in the classroom.

We need assessments that evolve towards formats that reward understanding, decision-making, and real-time reasoning. Oral exams used to be standard until they could not scale. Now, AI is making them scalable again.

And here is the delicious part: you can give the whole setup to the students and let them prepare for the exam by practicing it multiple times. Unlike traditional exams, where leaked questions are a disaster, here the questions are generated fresh each time. The more you practice, the better you get. That is... actually how learning is supposed to work.

Fight fire with fire.

Thanks to Brian Jabarian for the inspiration and for giving us confidence that these interviews will work, Foster Provost for lending his voice to create the FakeFoster agent (sorry, students found you intimidating!), and Andrej Karpathy for the council-of-LLMs idea.