A Computer Scientist in a Business School

Friday, September 6, 2024

Developing Grading Rubrics using Docent

When I explain the concept of Docent, a common first question I hear is if AI grades assignments, can't students just use AI to do their homework? They imagine a scenario where professors create assignments with AI, students complete them with AI, and graders assess them with AI as well.

But, let me clear that up—Docent doesn't work like that.

While we were crafting Docent, we figured out we really needed two things to make AI grading effective:

A "gold answer," which is basically the perfect solution to the assignment.
A "grading rubric," which is a guide on how to deduct points for mistakes.

The "gold answer" is our way of ensuring that even if an AI is doing the grading, students can't just whip up another AI to spit out the right answers. (For the future, we're considering adding a feature where Docent can tell if an assignment is easily solvable by an LLM, without having access to the gold answer.)

Now, developing a comprehensive grading rubric is a bit trickier. It's hard to guess all the ways students might slip up. In an "AI-less setting", we usually end up tweaking the rubric a few times over time, based on what we see after the assignment has been run a couple of times.

How can an LLM make our life easier? Docent is great when it comes to building these rubrics. Since it can handle grading hundreds of assignments at once, we quickly spot the common mistakes by simply asking Docent to grade the assignments and find the mistakes. We look at the identified mistakes, and we add them in the rubric. After adjusting the rubric, we ask Docent for a re-grade, and voila! After a few rounds, we end up with a solid rubric that catches most errors.

One additional cool thing about this whole process? Docent can summarize feedback from all the submissions and we can create a report on the most common slip-ups. We take this back to the classroom to chat about the tricky parts of the assignment and help everyone learn better.

It's like having a super-hard-working assistant who may not know how to grade at the beginning but is always willing and eager to help. They never complain if you ask them to regrade assignments, summarize findings, or provide feedback.

Use Docent, be lazy, and teach smarter, not harder!

Thursday, September 5, 2024

Grading with AI: Introducing Docent

TL;DR

An alpha version of Docent, our experimental AI-powered grading system, is now available at https://get-docent.com/. If you're interested in using the system, please contact us for support.

The Challenge of Grading

One thing that I find challenging when teaching is grading, especially in large classes with numerous assignments. The task is typically delegated to teaching assistants with varying levels of expertise and enthusiasm. One particular challenge is getting TAs to provide detailed, constructive feedback on assignments.

Our Experiment with LLMs

With the introduction of LLMs, we began exploring their potential to enhance the grading process. Our primary goal wasn't to replace human graders but to provide students with detailed, personalized feedback—effectively offering an on-demand tutor and addressing "Bloom's two-sigma problem.":

"The average student tutored one-to-one using mastery learning techniques performed two standard deviations better than students educated in a classroom environment."

To evaluate the effectiveness of LLMs in grading, we used a dataset of 12,546 student submissions from a Business Analytics course spanning six academic semesters. We used human-assigned grades as our benchmark.

Good Quantitative Results

Our findings revealed a remarkably low discrepancy between LLM-assigned and human grades. We tested various LLMs using different approaches:

With and without fine-tuning
Zero-shot and few-shot learning

While fine-tuning and few-shot approaches showed slight improvements, we were amazed to find that GPT-4 with zero-shot learning achieved a median error of just 0.6% compared to human grading. In practical terms, if a human grader assigned 80/100 to an assignment, the LLM's grade typically fell within the 79.5-80.5 range—a striking consistency with human grading.

Qualitative Feedback: Where AI Shines

LLMs excel at providing qualitative feedback. For example, in this ChatGPT thread, you can see the detailed feedback the LLM provided for an SQL question in a database course. Much better and more detailed than whatever any human grader was going to ever provide.

Real-World Implementation: Docent

Encouraged by these results, we implemented Docent to assist human graders in our Spring and Summer 2024 classes. We also conducted a user study to assess the perceived helpfulness of LLM-generated comments. However, during deployment, we identified several areas for improvement:

Excessive Feedback: The LLM often provides too much feedback, striving to find issues even in near-perfect assignments.
Difficulty with Negation: Despite clear grading guidelines, LLMs struggle to ignore specified minor shortcomings. See below :-)
Multi-Part Assignment Challenges: For assignments with multiple questions, grading each question separately yields better results than assessing the entire assignment at once.
Inconsistent Performance: While median performance is excellent, about 5-10% of assignments receive imperfect grades (compared to a human), leading to student appeals.

Current Status and Recommendations

Based on our experiences, here are our current recommendations for using AI in grading:

Human Supervised Use: Grading using LLMs is best used as a tool for teaching assistants, who should review and adjust the AI-generated grades and feedback before releasing them to students.
Caution in High-Stakes Scenarios: We advise against using AI for high-stakes grading, such as final exams, until we achieve greater robustness across all submissions.
Ideal for Low-Stakes Assignments: LLM-based feedback is well-suited for low-stakes assignments and practice questions, where even imperfect feedback improves the current status quo.

Try Docent

To facilitate experimentation with AI-assisted grading, we've deployed an alpha version of Docent at https://get-docent.com/. If you're interested in using the system, please contact us for support and guidance.

Thursday, January 18, 2024

The PiP-AUC score for research productivity: A somewhat new metric for paper citations and number of papers

Many years back, we conducted some analysis on how the number of citations for a paper evolves over time. We noticed that while the raw number of citations tends to be a bit difficult to estimate, if we calculate the percentile of citations for each paper, based on the year of publication, we get a number that stabilizes very quickly, even within 3 years of publication. That means we can estimate the future potential of a paper rather quickly by checking how it is doing against other papers of the same age. The percentile score of a paper is a very reliable indicator of its future.

To make it easy for everyone to check the percentile scores of their papers, we created a small app at

https://scholar.ipeirotis.org/

that allows anyone to search for a Google Scholar profile and then calculate the percentile scores of each paper. We then take all the papers for an author, calculate their percentile scores, and sort them in descending order based on their scores. This generates a plot like this, with the paper percentile on the y-axis and the paper rank on the x-axis.

Then, an obvious next question came up: How can we also normalize the x-axis, which shows the number of papers?

Older scholars have more years to publish, giving them more chances to write high-percentile papers. To control for that, we also calculated the percentiles for the papers published, by using a dataset of around 15,000 faculty members at top US universities. The plot below shows how the percentiles for the number of publications evolve over time.

Now, we can use the percentile scores for the number of papers published to normalize the x-axis as well. Instead of showing the raw number of papers on the x-axis, we normalize paper productivity against the percentile benchmark shown above. The result is a graph like this for the superstar Jure Leskovec

and a less impressive one for yours truly:

Now, with a graph like this, with the x and y axes being normalized between 0 and 1, we have a nice new score that we have given the thoroughly boring name "Percentile in Percentile Area Under the Curve" score, or PiP-AUC for short. It is a score that ranges between 0 and 1, and you can play with different names to see their scores.

~~At some point, we may also calculate the percentile scores of the PiP scores, but we will do that in the future. :-)~~ UPDATE: If you are also curious about the percentiles for the PiP-AUC scores, here is the distribution:

The x-axis shows the PiP-AUC score, and the y-axis shows the corresponding percentile. So, if you have a PiP-AUC score of 0.6, you are in the top 25% (i.e., 75% percentile) for that metric. With a score of 0.8, you are in the top 10% (i.e., 90% percentile), etc.

In general, the tool is helpful when trying to understand the impact of newer work published in the last few years. Especially for people with many highly cited but old papers, the percentile scores are very helpful for quickly finding the newer gems. I also like the PiP-AUC scores and plots, as they offer a good balance of overall productivity and impact. Admittedly, it is a strict score, so it is not especially bragging-worthy most of the time :-)

(With thanks to Sen Tian and Jack Rao for their work.)

Tuesday, October 18, 2022

Tell these fucking colonels to get this fucking economist out of jail.

Today is October 18th. It is 41 years since Greece voted for Andreas Papandreou with a 48% vote percentage to be elected as prime minister, fundamentally changing the course of history for Greece. Positively or negatively, this is still debated, but the change was real.

On October 6th, Roy Radner passed away at the age of 95. He was a faculty member at our department and a famous microeconomist with a highly distinguished career. Many others have written about him and his accomplishments as an economist and academic, so I will not try to do the same. 

But Roy also played an important role in making that election in 1981 possible. Why? Let me tell you his story.

"Geographic Footprint of an Agent" or one of my favorite data science interview questions

Last week we wrote in the Compass blog how we estimate the geographic footprint of an agent.

At the very core, the technique is simple: Use the addresses of the houses that an agent has bought or sold in the past; get their longitude and latitude; and then apply a 2-dimensional kernel density estimation to find what are the areas where the agent is likely to be active. Doing the kernel density estimation is easy; the fundamentals of our approach are material that you can find in tutorials for applying a KDE. There are two interesting twists that make the approach more interesting:

How can we standardize the "geographic footprint" score to be interpretable? The density scores that come back from a kernel density application are very hard to interpret. Ideally, we want a score from 0 to 1, with 0 being "completely outside of the area of activity" and 1 being "as important as it gets". We show how to use a percentile transformation of the likelihood values to create a score that is normalized, interpretable, and very well calibrated.
What are the metrics for evaluating such a technique? We show how we can use the concept of "recall-efficiency" curves to provide a common way to evaluate the models.

You can read more in the blog post.