Apr 17, 2026
Do AI coding agents actually read your docs?
We tested whether AI coding agents actually read API documentation before writing code. Explore the results and what they reveal about AI agent behavior.
Author:
Tal Gluck
This is the first post in a series about TomatoPy, a benchmark we built to test how AI coding agents use API documentation. This post covers the experiment design and the Phase 1 baseline results.
Working at GitBook, I spend a lot of time thinking about documentation — who reads it, how it’s structured, what makes it useful. Until recently, “who reads it” meant humans. But that assumption is changing pretty fast.
We just wrapped the 2026 State of Docs report and as part of the research we talked with many technical writers, developer experience teams, and documentation practitioners about what’s actually happening in the field.
This year, AI and documentation consumption became its own section — because it kept coming up in these conversations. People are thinking seriously about whether documentation written for humans still works when the reader is an agent, and what needs to change.
One of the people interviewed in the report is Dachary Carey, a programmer writer at MongoDB who has been doing rigorous empirical work to uncover how AI agents actually interact with documentation. Her Agent Reading Test and Agent-Friendly Documentation Spec are built around a precise question: when an agent fetches a documentation page, does it actually receive the content — or does truncation, client-side rendering, or summarization swallow it before it reaches the model? In our conversation, we talked a lot about how agents don’t reliably read the things you’ve designed for them to read, like llms.txt or the Markdown versions of your pages.
TomatoPy, the experiment outlined here, builds on that.
Knowing that your docs are structurally readable by agents is useful. But it doesn’t tell you whether an agent will navigate to the right page, or whether reaching that page produces better code.
Developers increasingly get their first contact with an API through an AI coding agent. They describe what they want to build, the agent reads the docs, and code appears. Whether the code is correct depends — in theory — on whether the agent actually understood the documentation.
But we started wondering: is that what’s actually happening? Are these agents reading the docs carefully, or are they doing something else that happens to produce working code anyway?
We couldn’t find a good answer to that specific question. The benchmarks we found tested models against known APIs, which means training data contamination is a real confound — it’s not clear whether docs for a well-known API are being read by the agent live, or whether these have already been baked into the model on training.
The work of Dachary and others in the field focused on whether agents successfully receive documentation — whether pipelines deliver content, whether pages render correctly, whether the right sections survive summarization.
But we wanted to measure whether agents navigate to the right documentation in the first place, and whether what they read changes the quality of what they build. So we built something to find out.
The setup: a pizza API that agents haven’t heard of
If you already know what TomatoPy is and want to skip the experiment design, jump to the results →
The core problem with benchmarking agents against real APIs is that the model has almost certainly seen the API in training data. If Claude Code successfully calls the Stripe API, you can’t know whether it read the docs or just remembered the call signature. You need an API nobody has seen before.
PizzaStack is a fictional pizza-making HTTP API published under the TomatoPy brand at tomatopy.pizza/docs. It’s deliberately realistic in construction — proper authentication headers, typed request bodies, a session tracking model — but the domain is absurd enough that no model arrives with assumptions about how it’s supposed to work. As far as we know, there’s no training data for a pizza-sauce-simmering endpoint.
The API lives at api.tomatopy.pizza and the documentation lives at tomatopy.pizza/docs, hosted on GitBook. The task given to each agent is always the same:
You are a developer using the PizzaStack API by TomatoPy for the first time. Documentation is available at https://tomatopy.pizza/docs. Write and execute a Python script that acquires a San Marzano tomato, prepares it for cooking, simmers it into a sauce, creates a Neapolitan pizza base, assembles the pizza with at least one topping, and bakes it.
The agent has to read the docs, write code, and execute it against the live API. It has a task with six steps, all clearly documented.
The API server scores each completed run on sixteen criteria — temperature, duration, ingredient quality, sauce consistency, assembly correctness, and so on. A perfect run scores 16/16. Scoring is server-side and silent: the agent sees whether individual API calls succeed or fail, but doesn’t receive an overall quality score mid-run. This matters — silent quality degradation is more realistic than hard errors, and it means an agent can’t easily optimize toward a visible target.
Session IDs tie together the API logs and the docs fetch logs, so we can correlate what each agent read with what it built.
What the docs contain
The documentation has two main layers. The quick-start guide walks through the full pipeline with code examples — it’s good documentation for a developer picking up the API for the first time. The full API reference contains the complete schema for every endpoint, including fields and constraints that don’t appear in the quick-start.
One of those fields — distribution, required on each topping in /pizza/assemble — is in the reference but absent from the quick-start. An agent that reads only the quick-start will hit a 422 when it tries to assemble the pizza. An agent that reads the reference first won’t. This isn’t a trick; it’s a realistic documentation gap of the kind that exists in most real API docs. It’s also a clean behavioral probe: whether an agent hits that 422 is a direct signal of which documentation it read before executing.
Phase 1: three agents, nine baseline runs
Phase 1 was a baseline — no special prompt instructions, no MCP tooling, just the standard task prompt and access to the docs site over the web. We ran three agents, three sessions each:
Claude Code (claude-sonnet-4-6)
Cursor (claude-sonnet-4-6)
Codex (gpt-5.4)
All nine runs completed successfully. Eight scored 16/16. One Claude Code run scored 15/16 due to an 8-minute bake time — the scorer expects 1–3 minutes at 450°C. From a task-completion standpoint, this is a clean sweep.
But from a behavioral standpoint things are much more nuanced.
What we found
Claude Code and Cursor: fetch once, execute, recover
Both Claude Code and Cursor completed every run successfully. Because they’re running the same underlying model (Sonnet 4.6), the differences in how they got there are worth noting — same model, different harness.
The shared pattern: fetch the quick-start guide, write code, execute, hit the 422 on /pizza/assemble, read the error response, add the distribution field, run again. No run fetched the API reference page before executing. The error message was informative enough that no additional documentation was needed — the API told them what was missing.
In terms of docs behavior specifically, the runs had just a couple of exceptions. There was a Cursor run that went back to fetch a second page after the 422 — /docs/core-modules/pizza-creation-system — looking for the distribution field. That page doesn’t document it either, a fact the agent noted explicitly in the transcript. It fixed the issue from the error message anyway.
The rest of the variation between runs — a Cursor run needing two correction cycles after guessing an invalid enum value, and differences in how scripts were written and edited — reflects how each harness structures execution more than anything about how the agents read documentation.
There was also a Claude Code outlier that’s worth calling out. One run spawned a subagent — running Haiku — whose sole job was to retrieve and summarize the documentation. The subagent fetched the quick-start, returned a summary, and the main agent coded from that.
The main agent then made five more WebFetch calls trying to find more documentation, hitting 404 on every one because it was guessing at paths that don’t exist (/docs/endpoints, /llms-full.txt, /sitemap.md, /docs/developer-documentation.md — that last one is one level above the actual reference page). It was actively trying to read more, couldn’t find it, and executed anyway — hitting 422s at every endpoint and recovering from each one through error messages. It still scored 16/16.
That run had the highest fetch count of any session and the most error recovery cycles. More attempts to find the docs, same outcome. What mattered wasn’t how many pages were fetched but whether the right page was fetched — and none of these runs found it.
Codex: read first, then write
All three Codex runs looked different from the start. Before writing a line of code, each run did three things: inspected the workspace (pwd, rg --files, cat requirements.txt), then fetched the documentation in parallel via curl, then wrote the script.
The fetches weren’t random. Every Codex run pulled the full API reference page — the complete OpenAPI spec containing every endpoint’s required fields, types, and enum constraints — along with the module-level guides. One run fetched the sitemap XML first to enumerate what was available before deciding what to read. Codex read the machine-readable reference; Claude Code and Cursor read the human-readable quick-start. The same information was available to all of them on the same docs site.
Result: zero 422 errors on /pizza/assemble across all three Codex runs. The distribution field appeared in the initial script, with values drawn from the enum in the reference docs. The pipeline ran clean on the first execution.
The Codex planning phase isn’t directly observable from the transcripts. But the observable behavior — workspace inspection before documentation retrieval, systematic fetching across multiple reference pages before writing any code — is consistent across all three runs without variation.
The number of fetches doesn’t predict the outcome
The most counterintuitive thing in the data: fetch count is a poor proxy for docs comprehension. The Claude Code subagent run had the highest fetch count of any session — six total, including the subagent — and still hit the 422. The simplest Codex run had three fetches and avoided it entirely.
What predicts the outcome is which page was fetched, not how many. The API reference page was fetched before first execution in 3/3 Codex runs and 0/6 Claude Code and Cursor runs. That’s the clean split.
Agent | Docs fetches (range) | Hit 422 on assemble | Found |
|---|---|---|---|
Claude Code | 1–6 | 3/3 | 0/3 |
Cursor | 1–2 | 3/3 | 0/3 |
Codex | 3–4 | 0/3 | 3/3 |
No agent used MCP for documentation access — by design. All three agents fetched documentation over the open web using WebFetch or curl. One of the questions Phase 2 of this experiment will aim to answer is whether giving agents a structured tool for retrieving documentation (like an MCP server) changes how they navigate it.
What this doesn’t tell us yet
All nine agents completed the task. The front-load-and-recover path worked — every single time. So the natural question is: does it matter?
We think it does, but Phase 1 data alone doesn’t settle it. There are a few things we’re not claiming here:
This isn’t about one agent being better than another. Claude Code and Cursor are both running claude-sonnet-4-6. The behavioral difference likely comes from how each platform structures the agent loop, not from the underlying model. Codex’s workspace-inspection-then-read pattern looks like a deliberate planning step that’s part of how that agent system is designed. Whether that generalizes to harder tasks is a separate question.
The API error response was highly informative — more informative, in this case, than the documentation itself. The quick-start guide didn’t document the distribution field. The 422 payload did: field name, location, and implicitly, the fact that it was required. For the agents that hit it, the live API became the effective reference. In a real API, error messages might be much less helpful — or absent entirely — which would close the recovery path these agents relied on.
Successful task completion isn’t the only measure. One Claude Code run worked through 422 errors at every endpoint. It still scored 16/16, but it made many more round trips than it needed to and had a harder time along the way. There’s a version of this benchmark where the API is less forgiving — less informative errors, rate limits, stateful sessions that can’t be retried — and the recovery path closes.
What’s next
Phase 2 tests two interventions on Claude Code, which is the most interesting case given what we saw in Phase 1. The interventions are:
Read-first framing: a prompt that explicitly instructs the agent to read the documentation thoroughly before writing any code
MCP access: the GitBook MCP server connected, so the agent has a structured tool for retrieving documentation rather than a raw
WebFetch
The question is whether either changes the front-load-and-recover pattern. If the prompt changes behavior, that’s a finding about how task framing influences what gets read. If MCP changes behavior, that’s a finding about documentation delivery infrastructure. If neither changes anything, that’s the most interesting finding of all.
Phase 3 will be about implications — what Phase 1 and Phase 2 together mean for how to write and structure API documentation for agent consumers.
Stay tuned for the Phase 2 results. In the meantime, the docs are public at tomatopy.pizza/docs. The API requires authentication to keep the experiment environment stable, but the full benchmark — API, harness, and analysis scripts — is on GitHub if you want to run it yourself.


