In my previous post, I described how working with AI agents felt like managing an infinitely large, infinitely diligent team. I wrote about pairing Claude with GitHub, giving it context files and task lists, and watching it come back with actual deliverables.
After that post, I got questions from a lot of people asking how to actually set this up. Even from people I assumed were already using this kind of workflow. Turns out it was far less common knowledge than I previously thought. (I guess I am spending too much time reading social media.)
So this post is a step-by-step guide for those who still use AI tools in the "chat" form and want to examine a first setup of "agentic AI". In this case, it is not to get the AI to be a software engineer, but rather get the AI to becoming your project manager and your team of research assistants.
We will set up a GitHub repository, configure Claude Code on the Web, and build a workflow where AI plans (or does) the work and you do the reviewing.
One caveat: while you do not need to know how to code, familiarity with software development practices will help. Not the programming itself, but the process: how developers organize projects, track changes, review each other's work. This post will walk you through those practices.
First, though, let me explain why this setup is so powerful.
The real trick: The repo is the context
Here is the problem with using AI through a regular chat interface. Every time you start a new conversation, you are starting from zero. You paste in your document, re-explain what the project is about, remind the AI where you left off, describe what needs to happen next. It is like hiring a brilliant contractor who gets amnesia every morning.
GitHub solves this. When Claude Code connects to your repository, it does not just see your files. It sees everything: the project structure, the notes about what the project is, the task list, the record of what has already been done, the decisions you have made along the way... All of it, sitting right there in the repo, ready to be read.
This means your prompt for most interactions becomes absurdly simple:
"Let's work on the next most important task."
That is all. Claude reads your CLAUDE.md to understand the project. It reads your TASKS.md to figure out what needs doing. It looks at the existing files to understand the current state. And then it gets to work. No pasting. No re-explaining. No "as I mentioned in our previous conversation..." The repository is the conversation. It is the memory. It is the context.
Read about CLAUDE.md and TASKS.md and you are worried that this is some black magic? Nah, these are just regular text files, written in plain English. We will describe them next.
Wait, what is Claude Code on the Web?
First, some context. Claude Code started as a command-line tool. You would install it on your computer, open a terminal, and type commands. Powerful, but intimidating if you are not a developer.
Then Anthropic launched Claude Code on the Web. Now you can do the same thing directly from your browser. You connect a GitHub repository, give Claude a task, and it clones your repo, writes code (or documents, or reports, or whatever you need), and pushes the changes to a branch. You review the changes, approve them, and merge. All from a web interface. No installation.
Claude Code on the Web operates inside a real computing environment called the "sandbox". It can read your files, create new ones, run scripts, and push changes to GitHub. It tends to write software for performing various tasks, instead of replying in plain text. It does work. Real work. The kind you would normally delegate to a research assistant or a junior colleague.
The 10-minute setup: GitHub + Claude Code
OK, let us build this from scratch. I will assume you have zero GitHub experience.
Step 1: Create a GitHub account and a repository.
Go to github.com and sign up. Then create a new repository: click the green "New" button, give it a name (something like my-research-project or quarterly-report), make sure to set it to Private (not Public, unless you want the whole internet reading your drafts), and check "Add a README file." That last part matters. Write a short description of your project in the README. Even a couple of sentences is fine. This initializes the repo so that Claude Code can actually work with it. (An empty, uninitialized repo will cause problems.)
Step 2: Connect your repo to Claude Code.
Go to claude.ai and open Claude Code (it is in the left sidebar, or you can go directly to claude.ai/code). Start a new session and connect your GitHub repository. You can paste your repo URL directly or use the built-in GitHub integration to browse your repositories. Claude will ask you to authenticate with GitHub the first time (a one-time OAuth flow) and install Claude on the Github repo (that allows Claude to write to the repo). Select the repo you just created.
Now Claude Code can see your files, and more importantly, it can change them.
At this point, you can upload files that you have about the project to the repo, or you can defer that step for later and move on to the next step.
Step 3: Let Claude set up your project.
This is where it gets interesting. CLAUDE.md is a special file that Claude reads at the start of every session. It is the project's "master plan": what the project is about, how it is organized, what conventions to follow. But you do not need to know what it should look like. Just describe your project in plain language:
"This repo contains the data and analysis from our AI-powered oral examination system, which I wrote up as a blog post. I want to turn this into a research paper for submission to Communications of the ACM. The data and some initial analysis scripts are already in the repo. Set up the project structure for a CACM submission and create a CLAUDE.md file."
Claude will read through the existing files, figure out what is there, organize everything into a sensible structure, and create a CLAUDE.md that might look something like this:
# Project: AI-Powered Oral Examinations at Scale
## Overview
Research paper for Communications of the ACM describing our system
for conducting and grading oral examinations using conversational AI
agents and a multi-LLM grading approach.
## Submission Details
- **Journal**: Communications of the ACM
- **Format**: ACM `acmart` document class, `acmsmall` style
- **Page limit**: 12,000 words including references
- **Style**: Author-year citations (natbib)
## Structure
- `/paper/` - LaTeX source files and ACM style files
- `/data/` - Exam transcripts, grading data, survey responses
- `/analysis/` - Python scripts for statistical analysis
- `/figures/` - Generated plots (PDF format, generated from scripts)
- `/blog/` - Original blog post and supporting materials
## Conventions
- All figures must be generated from scripts in `/analysis/`,
never created manually
- Use BibTeX for references (`references.bib`)
- Data files are never edited directly; all transformations
happen through scripts in `/analysis/`
- Student data must be anonymized in all outputs
## Current Status
See TASKS.md for the current task list and priorities.
Notice: you did not write any of this. You described your project, and Claude produced the project master plan. You review it, maybe tweak a couple of things. Done.
Step 4: Create your TASKS.md file.
This is your project's to-do list. But unlike a regular to-do list, it serves double duty: it tells Claude what needs to be done and keeps a record of what has been completed. Ask Claude to create it:
"Create a TASKS.md file with the following initial tasks..."
Here is what one might look like:
# Tasks
## In Progress
- [ ] E1. Expand blog analysis into formal experimental evaluation
- [ ] E2. Inter-rater reliability analysis (human vs. LLM council grades)
## To Do
- [ ] E3. Create Figure 1 (grade distribution across grading methods)
- [ ] R1. Write Related Work section (AI in assessment, LLM-as-judge)
- [ ] D2. Analyze anti-cheating detection rates
- [ ] Z3. Check word count against CACM 12,000-word limit
## Done
- [x] Z1. Set up project structure from blog post materials
- [x] D1. Anonymize student data
- [x] I1. Write Introduction draft
Now here is the magic. You can point Claude at a specific task and say: "Work on the next task in TASKS.md." Claude reads the file, picks the next item, does the work, updates the task status, and creates a pull request with its changes. If you are not familiar with pull requests, more in a moment.
Pull requests: Redlined documents for coders (and not only)
Now the part that is unfamiliar to people who are not software engineers. The "pull request".
If you have ever received a redlined document from a lawyer, or reviewed tracked changes in a Word file, you already understand pull requests. The concept is that simple: someone proposes changes, you review them before they get incorporated into the main document.
In GitHub, it works like this:
- Claude does its work on a separate branch (a parallel copy of your project).
- When it is done, it creates a pull request (PR), which says: "Here are the changes I made. Want to incorporate them?"
- You see a clean diff view showing exactly what was added, removed, or modified. Green lines are additions. Red lines are deletions.
- You review. You can approve, request modifications, or reject.
- If you approve, you click "Merge" and the changes become part of the main project.
This is the standard process used by every software team in the world. And it works for any kind of knowledge work that relies on text. Research papers. Reports. Course materials. Business proposals. Anything that lives in files. Ideally, you want the files to be text files and not binary ones; tex good, PowerPoint files, not so much. In the future we may have better tooling for reviewing changes in Office files or other formats, but for now the process works best for text-based files.
Fair warning: the GitHub interface will look busy the first time you open a pull request. Do not panic. Just look for the "Files changed" tab to see the redlines, and the big green "Merge pull request" button when you are ready to accept.
The critical point: you never edit the files directly. You describe what you want, Claude proposes changes, and you review and approve. You are the manager. Claude is the diligent employee who comes back with deliverables for you to inspect. And the audit trail is far better than "Track Changes" in Word ever was.
A real example: From CSV to submission-ready in two hours
Let me show you how this plays out in practice with a real example from last month.
I was working on a paper that had a case study section (say, Section 8) where we discussed results from a partner's dataset, but we only had the final business conclusions, not a full experimental analysis. The rest of the paper (say, Section 7) had a proper, thorough analysis on a different dataset: figures, tables, bootstrap confidence intervals, the works. By comparison, the case study in Section 8 was the weak sibling, and reviewers have flagged that. We have received a detailed dataset from our partners, but it required work. My TASKS.md had this sitting in it:
## Backlog
- [ ] F5. AML dataset analysis
- [ ] G1. Complete §8 rewrite with AML dataset
I uploaded the CSV to the repo and told Claude:
"Here is the AML dataset. Replicate the analysis from Section 7 but now for Section 8. Use the existing details from Section 8 as the background and framing, conduct the full experimental analysis, and generate a new Section 8."
Claude read Section 7 to understand the methodology. It read the existing Section 8 to understand the framing and context. It wrote Python scripts to process the AML data, generated four figures and three tables with bootstrap confidence intervals, wrote the new section text with all quantities pulled from the analysis scripts, and submitted a pull request with everything.
Less than an hour. I spent another hour reviewing the PR, checking the code, leaving comments ("clarify this axis label," "move this paragraph before the table", "I do not think the conclusions follow from the results"), and merging.
Two hours total. For a PhD student, this would have been a few days of work, easily. And here is the part that matters: every single number in that section was generated through a Python script. Every figure had a script that produced it. Reproducibility was built in from the start, not bolted on after the fact. The pull request showed me exactly what was added: the scripts, the outputs, the LaTeX changes. I could trace every claim back to the code that produced it.
Needless to say, I remain fully accountable for any bugs or errors. At the end of the day, I have reviewed the scripts, the results, and the text. What I can say is that even if there are errors, these are not "hallucinations" where the LLM filled in random numbers or references in the text. The figures are Python-generated from the raw data, the tables and the numbers in the text the same. The errors can come from bugs, or other oversights. But we should stop calling all AI errors "hallucinations". At this point, the errors are not the errors of a "bullshitter in chief" (a title aptly earned by early LLMs); they are the same types of errors that a junior colleague may make when carefully executing a well-defined task: misreading a specification, applying a method slightly outside its intended scope, or missing an edge case that a more seasoned eye would have caught.
Beyond software: Why this works for all knowledge work
I want to be explicit about something: this is not just for code. GitHub repositories can hold any kind of file. Markdown documents, LaTeX papers, CSV data files, images, PDFs. The pull request workflow works for anything.
Writing a consulting report? Put the markdown draft in /report/, the supporting analysis in /data/, the charts in /figures/. Claude generates the analysis, creates the figures, and drafts sections of the report, all as reviewable pull requests.
Same idea for course materials (I use this with my exit tickets workflow), business plans, grant proposals. You define the project structure, you maintain a task list, and you let the agent do the work while you review proposals. Standard software engineering practice, applied to everything.
Leveling up: More files for better project management
Once you get comfortable with CLAUDE.md and TASKS.md, you can add more structure. The files I have found most universally useful are these three:
- SCHEDULE.md — Deadlines and milestones. "The submission deadline is March 15" becomes a constraint that shapes which tasks get prioritized first.
- DECISIONS.md — Key choices and their rationale. "We decided to use three LLMs in the grading council instead of five because the marginal improvement was negligible." Prevents you and Claude from relitigating settled questions two weeks later.
- STYLEGUIDE.md — Your writing preferences. "Never use em-dashes," "Never use fluffy adjectives," "Avoid claims not supported by data or citations." Good trick: give Claude a few pieces of your favorite writing and ask it to generate a style guide that mimics your voice. Then drop it in the repo.
Beyond these, there are files worth adding for specific situations:
- CHANGELOG.md — Human-readable log of what changed each session. Especially useful when preparing a response to reviewers.
- BLOCKERS.md — Things waiting on someone external. Makes it easy to send a collaborator a list of "here is what I need from you."
- FEEDBACK.md — Running log of all feedback received, formal and informal, with status: pending, accepted, or rejected with rationale.
- SOURCES.md — Annotated bibliography: what each source is useful for, how reliable it is, which sections cite it.
- GLOSSARY.md — Keeps terminology consistent across a long document. Claude consults it and adds new terms as they come up.
- DEPENDENCIES.md — Maps how artifacts depend on each other. Lets Claude flag when an upstream change invalidates something downstream.
You do not need all of these on day one. Start with CLAUDE.md and TASKS.md. Add CHANGELOG.md when editing a paper that came back with revisions. Add the rest as your project grows and you find yourself needing them.
To be fair, this is a bit of a hack. We are simulating standard project management tools using plain markdown files. Scanning text files for task lists and decisions is not exactly elegant. And I have serious doubts that this can scale for projects involving hundreds of people. But it works for now, with tools that exist today, for the projects that I am working on.
In the future, agents will have proper interfaces: structured databases, purpose-built PM tools designed for agents to read and write directly, not markdown files they have to parse every session. We are in the duct-tape-and-baling-wire phase. It is fine. The duct tape holds.
The awkward part (and why it is worth it)
If you are not a software engineer, this workflow feels strange at first. You are used to opening a document and typing. Now you are writing instructions, waiting for an AI to propose changes, and clicking "Merge" on a pull request. It is indirect. It feels like you are adding a middleman.
But here is what happens after a week: you realize the middleman can do 80% of the work. And the 20% you are doing (reviewing, giving feedback, making decisions) is the work that you would have done with any apprentice. But you are not fixing typos, you are not formatting tables, you are not wrestling with matplotlib's axis labels. You are reading the output and deciding if it is good and trustworthy enough.
Coming next
This post covered the basics: one repo, one project, Claude Code on the Web doing the work. The whole secret is that now the chatbots can write down what they have done, and look up the notes next time you start working together. And it is ridiculously powerful.
But this is just the beginning.
In upcoming posts, I will describe my "master repo, satellite repos" setup, where I maintain a central task management repository that coordinates work across multiple projects with different collaborators. Think of it as the command center. I will also walk through my MCP (Model Context Protocol) configuration for integrating Gmail and Google Calendar directly into Claude Code, so the agent can check my schedule, draft emails, and coordinate meetings as part of its workflow.
Beyond that: deploying resources on Google Cloud, spinning up virtual machines for heavy computation, and the "council of LLMs" approach where Claude, Gemini, and GPT deliberate together on evaluation tasks (something I have been using for grading oral exams and am now extending to research).
At some point (in the not so distant future, probably by the end of March or so) Claude will be scheduling my meetings, answering my emails, and assigning me tasks from my own task list. I am not entirely sure who is managing whom anymore.