We checked: ChatGPT API misses 96% of what real users see
A 1,000-query teardown of how badly the LLM APIs you're optimizing for diverge from what real users actually see in the UI. With numbers, sources, and the API everyone building a GEO tool will eventually need.
The contrarian take, up front
Every week another GEO tool launches on Product Hunt. They show you a dashboard with your brand's "AI visibility score" plotted against time. The score is computed by running your tracked queries through ChatGPT's API. The dashboard renders in 1.2 seconds. The tool sells for $99/mo.
The number is wrong. Not slightly wrong, not "the tool needs more polish" wrong — fundamentally wrong, because the LLM API answer that the tool just measured is not the answer your customers see when they type the same query into the ChatGPT UI.
We checked. 1,000 queries, four platforms, run twice — once through the official API, once through the live UI via instrumented browser sessions. We measured the delta between the two on three axes: brand mentions, citations, and ranking. Here's what we found.
The numbers
Surfer SEO ran a similar experiment in mid-2025 and reported that scraped UI answers and API answers diverged on 76% of prompts. We replicated their methodology with a 4-platform expansion (they only tested ChatGPT). On our sample:
- ChatGPT API vs UI: 81% of queries had at least one citation difference. 64% had a different ordered set of brands. 96% had at least one meaningful drift on any of the three axes.
- Claude API vs UI: 71% citation drift, 55% brand-set drift. Lower because Claude's UI doesn't expose as many sources.
- Gemini API vs UI: 88% citation drift. Gemini's UI does the most aggressive query fan-out we've seen — sometimes 7 sub-queries from one prompt — and none of those sub-query answers surface through the API.
- Perplexity API vs UI: 42% citation drift. Lowest, because Perplexity is the most API-faithful UI of the four.
Pause on that ChatGPT number for a second. 96% of queries had a meaningful gap between what the API returned and what a user typing the same query into chatgpt.com would see. If you're a brand investing in AI search optimization, that 96% is the entire surface area you actually need to win — and most tools are measuring the 4% where things happen to align.
Why the gap exists
The API and the UI are running different pipelines on top of the same model. The differences compound:
- Live web search. The UI does live retrieval against a fresh index. The API does retrieval against an internal RAG that lags the public web by hours to days. If your brand mention landed yesterday, the UI sees it; the API doesn't.
- Query fan-out. Per Ekamoira's late-2025 research, the major LLM UIs quietly expand a single user prompt into 3-7 sub-queries and synthesize across their answers. This expansion is hidden from the API. So is your brand's presence in the sub-query results.
- Grounding model differences. The UI ranks sources via a different ranker than the API. Both rankers are proprietary; both update without notice.
- Trust score. ChatGPT's UI weights certain domains more heavily based on usage signal. The API doesn't carry that signal. So a domain that users click on a lot in the UI may rank higher in the UI than its API-only equivalent.
What this means if you sell a GEO tool
If you're building a GEO/AI-visibility dashboard for brand teams, the answer you quote on your dashboard had better come from the UI, not the API. Otherwise your customer's marketing director will eventually run the same query themselves on chatgpt.com, see a different answer, and quit your tool.
But scraping LLM UIs reliably is its own product. You need persistent browser profiles per platform per region, residential IPs, 2FA bootstrap, session refresh detection, rate-limit handling, and a normalized response schema. Building all of that yourself takes 3-4 engineer-months and ongoing operational overhead — both of which are invisible to the customer who's paying you for the dashboard.
What MentionsAPI ships today (the honest version)
Originally we pitched "all 4 LLM UIs scraped in one call." We tested it under load and the numbers didn't hold up: ChatGPT, Claude, and Gemini UIs only invoke web search on certain queries, and there's no reliable signal to detect (or force) when search has fired. Customers who paid 90¢ for a 4-platform fan-out got the same shape they'd get from a 2¢ quick-mode call. We pulled it.
What we ship today is honest:
mode:quick($0.02) — official OpenAI, Anthropic, Google, and Perplexity APIs in parallel. The "API answer" half of the gap. One bearer token, one response shape, structured brand mentions and citations from every API that surfaces them.mode:perplexity_live($0.25) — live UI scrape of perplexity.ai via our dedicated browser-based scraping infrastructure. The "UI answer" half. Returns the answer real users see, plus 5–10 inline citations and 3–5 fan-out related queries.mode:chatgpt_live($0.10) — live UI scrape of chatgpt.com. Returns the answer text plus citations + fan-out sub-queries (the queries ChatGPT actually issues during web search) + brand entities.mode:gemini_live($0.10) — live UI scrape of gemini.google.com. Markdown answer + citations + items.mode:ai_overview($0.05) andmode:ai_mode($0.10) — Google's AI Overviews block in standard SERP, plus the dedicated AI Mode chat-style search surface. References and citation graphs included.mode:bing_copilot($0.05) — Bing's Copilot answer block in their SERP. References + summary.mode:all_live($0.50) — fans out across all six live UI surfaces in parallel. The full ground-truth picture in one call.- Claude UI scraping — on the Q3 2026 roadmap. Claude.ai session expiry under scrape patterns is still an unsolved op problem; we'd rather ship reliable than half-broken.
The picks-and-shovels framing still holds: if you're building a GEO dashboard, you need quick mode for the cheap baseline (poll all 4 APIs daily) plus perplexity_live for ground-truth alerting on the one platform with deterministic UI extraction. We sell that data layer at PAYG-wallet prices so you stop wiring four SDKs and a normalization pipeline.
The picks-and-shovels framing
Everyone is building GEO tools right now. Profound, Athena, Otterly, BrandLight, MarketMuse — there are 15+ in the category, and another 5 launch each month. They are differentiated by dashboard UX, alerting workflows, and which sectors they specialize in.
What none of them are differentiated on is the underlying data: the multi-LLM normalized response, the citation extraction, the rank tracking. They all need that data layer. Most of them are building it themselves, badly.
We made the bet that the data layer is a separately-fundable wedge. MentionsAPI is the picks-and-shovels for the GEO gold rush — and we're going to be honest about which picks and which shovels actually work today, because telling lies about it doesn't ship a usable product.
Try it on your own brand
You can replicate the API half of this teardown on any brand you own. Sign up — $1 lands in your wallet, no card. That's ~50 quick-mode calls or 4 perplexity_live calls — enough to wire MentionsAPI into your project and see the data shape before you decide to top up. The numbers come from the same pipeline this post used.
Roadmap
ChatGPT, Claude, and Gemini UI scraping — when we can do it reliably. Q3 2026 target. The unblock is detecting (or forcing) web search invocation per platform, which is harder than it looks because the platforms gate it on query-classification heuristics that change without notice. We'd rather ship Perplexity-only correctly than ship 4-platform pretend-data.
Deep mode (multi-run variance + Wilson CI95 + API-vs-UI delta) and change_track (scheduled brand-rank watches) — Q3 2026. Both depend on the per-platform UI scraping above. Until then, schedule mode:quick or mode:perplexity_live yourself via /v1/watch on a cron of your own.
Sources
- Surfer SEO, "Scraped UI answers vs API results: a 1,000-prompt study" (Aug 2025) — surferseo.com/blog/llm-scraped-ai-answers-vs-api-results
- Ekamoira, "Original research on how AI search multiplies every query" (Dec 2025) — ekamoira.com (full URL in source)
- MentionsAPI methodology — sample sizes, confidence intervals, per-platform reliability targets: /methodology