10 June 2026

Can AI Read Your Site? The robots.txt Mistake Hiding You From ChatGPT

Most "we're invisible in ChatGPT" problems are an access problem, not a content one. Here's the robots.txt mistake that blocks AI crawlers — and how to check in two minutes.

By AEO Team

Most "we're invisible in ChatGPT" problems aren't a content problem — they're an access problem. Before an AI engine can recommend your business, its crawler has to be able to fetch your pages, and a single stray line in a config file can quietly lock every one of them out. This is the most common own-goal in answer-engine optimisation (AEO), and it's also the easiest to fix.

The three preconditions for ever being cited

An AI engine can only cite a page that is crawlable, indexable, and extractable — fail any one and you're invisible. These three conditions stack in order, and most "AI can't find us" cases fail at the very first step.

Crawlable — the AI platform's bot is allowed to fetch the page (not blocked by robots.txt, a firewall, or a login wall).
Indexable — the page is actually in the index that the engine searches against. ChatGPT search leans on a Bing-based index; Claude retrieves through Brave Search; Perplexity blends its own index with results from Google and others. If a page never enters those indexes, it can't surface.
Extractable — the answer text is present in the raw HTML the bot receives, not painted in later by JavaScript. AI systems pull out passages, not whole pages, so the relevant sentences have to be readable on first fetch.

Get all three right and you're a candidate for citation. Miss one — usually the first — and nothing else you do for AEO matters.

Which AI crawlers to allow

To be eligible for citation on a given platform, you must allow that platform's search and user-fetch bots in robots.txt. Blocking a bot is an explicit instruction that the platform behind it cannot read — and therefore cannot cite — your site. Each major engine sends a recognisable user-agent, and several run more than one bot for different jobs.

AI user-agent	Platform it powers	Purpose
GPTBot	OpenAI / ChatGPT	Crawls for both training and ChatGPT search
OAI-SearchBot	OpenAI / ChatGPT	Builds the ChatGPT search index
ChatGPT-User	OpenAI / ChatGPT	Live fetch when a user's question triggers a browse
Google-Extended	Google Gemini & AI Overviews	Opt-in control for Gemini and AI Overviews
PerplexityBot	Perplexity	Indexes pages for Perplexity answers
ClaudeBot	Anthropic / Claude	Anthropic's crawler
anthropic-ai	Anthropic / Claude	Legacy Anthropic user-agent
Bingbot	Microsoft Copilot (and ChatGPT's index)	Powers Copilot and feeds the Bing-based index ChatGPT uses

A subtle but important point: the crawl-versus-train distinction differs by vendor. OpenAI's GPTBot serves both purposes — it gathers training data and contributes to search — so blocking it to avoid training also removes you from ChatGPT search. OpenAI also publishes OAI-SearchBot and ChatGPT-User, which you can allow independently if you want search visibility without contributing to training. Treat each bot as a separate switch.

The one bot you can safely block

CCBot (Common Crawl) can be blocked without losing any AI-search citations, because it is a training-data crawler, not a live retrieval bot. Common Crawl builds a public dataset that many models train on, but no answer engine fetches your page from CCBot at query time. If your only goal is to be recommended by AI search — not to feed model training — disallowing CCBot costs you nothing in citations.

A copy-pasteable robots.txt that allows AI search

Drop this at https://yourdomain.com/robots.txt. It explicitly welcomes the bots that drive AI-search citations, blocks the training-only Common Crawl bot, and keeps a sane default for everyone else. Adjust the disallowed paths to suit your site.

# Allow AI search + answer engines to read and cite the site

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Bingbot
Allow: /

# Training-only crawler — safe to block without losing AI-search citations
User-agent: CCBot
Disallow: /

# Everyone else: default crawl, keep admin and cart out
User-agent: *
Disallow: /admin/
Disallow: /cart/
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Two things to check after you deploy it. First, make sure no earlier Disallow: / rule for User-agent: * sits above these blocks — a blanket disallow at the top can override the specific allows in some parsers, so keep your global rules clean. Second, confirm the file returns HTTP 200 and is plain text; a robots.txt that 404s or redirects to HTML is treated as "no rules", which is usually fine, but a robots.txt that returns a 5xx error can cause some bots to back off entirely.

The four ways sites accidentally block AI

Most accidental blocks come from one of four places: robots.txt rules, CDN or WAF bot filtering, JavaScript-only rendering, or login walls. Robots.txt is the most visible, but it's often not the culprit.

robots.txt rules — a Disallow: / under User-agent: *, or a specific Disallow for GPTBot or PerplexityBot left over from a 2024 "block the AI scrapers" wave. Many sites added these to resist training and never realised they were also opting out of AI search.
CDN / WAF bot blocking — Cloudflare, Fastly, AWS WAF and similar services ship "block AI bots" toggles and managed bot rules that drop AI user-agents at the edge, before your server or robots.txt is ever consulted. The page looks fine in your browser but returns 403 to GPTBot. This is the sneakiest failure because nothing in your own codebase changed.
JavaScript-only rendering — if your content is rendered client-side and the raw HTML is nearly empty, many AI fetchers receive a blank shell. They extract passages from what's in the initial response, so a single-page app that hydrates content after load can be technically "allowed" yet have nothing to read.
Login or paywall walls — content behind authentication, an email gate, or an aggressive cookie/consent interstitial is unreadable to bots that don't log in. If a human needs to click "accept" or sign in to see the text, so does the crawler — and it can't.

How to check whether AI can read your site

Work through these in order; most problems reveal themselves in the first two steps.

Read your own robots.txt. Open https://yourdomain.com/robots.txt in a browser. Look for any Disallow: / under User-agent: *, and for named rules targeting GPTBot, OAI-SearchBot, PerplexityBot, ClaudeBot or Google-Extended. If you see them, that platform has been told not to read you.
Check server logs or analytics for AI user-agents. Search your access logs for GPTBot, OAI-SearchBot, ChatGPT-User, PerplexityBot and ClaudeBot. If they appear with 200 responses, you're being crawled. If they appear with 403 or 429, something — usually a CDN or WAF — is blocking them at the edge.
Test that a page renders without JavaScript. Disable JavaScript in your browser (or run curl -A "GPTBot" https://yourdomain.com/your-page and read the raw HTML). If the main content is missing, AI fetchers are seeing the same emptiness. Server-render or pre-render the pages that matter.
Confirm there's no login or consent wall on key pages. Open an important page in a private window with no cookies. If you're forced to sign in or dismiss a blocking interstitial before the content appears, a bot will hit the same wall.
Run an automated AEO audit. An AEO audit checks AI crawler access across all the major bots in one pass and flags blocks you'd otherwise have to hunt through logs to find — a quick way to confirm steps 1 through 4 at once.

The emerging llms.txt idea

llms.txt is a proposed plain-text file at your site root that points AI systems to your most important, clean content — but it is not yet a replacement for getting crawler access right. The idea borrows from robots.txt: a Markdown file at /llms.txt lists your key pages and a short description of each, giving models a curated map instead of forcing them to infer structure from messy HTML.

It's worth knowing about and cheap to add, but keep expectations realistic. Adoption across the major engines is still uneven, and no AI platform will cite a page it can't crawl in the first place. Treat llms.txt as a helpful signal layered on top of solid fundamentals — crawlable, indexable, extractable — not as a substitute for them.

Frequently asked questions

Does blocking GPTBot stop ChatGPT from citing my site?

Yes. GPTBot is OpenAI's crawler for both training and search, so a Disallow rule for GPTBot removes you from ChatGPT's reach entirely. If you want to stay out of training but remain citable in ChatGPT search, allow OAI-SearchBot and ChatGPT-User specifically, and decide on GPTBot separately rather than blocking everything by reflex.

Can I block AI training but still get cited in AI search?

Largely, yes. Allow the search and user-fetch bots — OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeBot and Bingbot — so engines can retrieve and cite your pages, while blocking training-only crawlers such as CCBot (Common Crawl). The one nuance is OpenAI's GPTBot, which serves both purposes, so you'll need to weigh whether to allow it.

Why can AI see my competitor but not me?

The usual reason is access, not content quality. A CDN or WAF rule may be returning 403 to AI user-agents, a leftover robots.txt Disallow may be blocking them, or your pages may render content only via JavaScript that the bot never sees. Check your robots.txt and server logs for AI bots first — the fix is often a single line.

Is llms.txt required to get cited by AI?

No. llms.txt is an optional, still-emerging convention that can help models find your best content, but it is not required and is not yet honoured consistently across engines. The non-negotiable basics are being crawlable, indexable and extractable. Get those right first; add llms.txt as a bonus.

The two-minute check that's worth your time

Blocking AI crawlers is the most common and most costly invisible mistake in AEO — costly because you never see an error, you simply never appear. The good news is that it takes about two minutes to rule out: open your robots.txt, scan your logs for AI user-agents, and load a key page with JavaScript off. If all three look clean, you've cleared the hardest hurdle to getting recommended by ChatGPT, Gemini, Claude and Perplexity — and everything else you do for AEO can actually count.

See how visible your brand is in AI search.

Run free audit