Will blocking GPTBot remove my site from ChatGPT?

Not entirely. Blocking GPTBot prevents OpenAI from using your content to train future models, but it does not stop ChatGPT Search from retrieving your pages live through OAI-SearchBot when a user asks a relevant question. To remove your site from both training and live retrieval, you have to block both bots in robots.txt.

Does Google use a separate crawler for AI Overviews?

Yes. Google uses Google-Extended as a separate token to govern AI training and AI Overview generation, while Googlebot continues to index your pages for classic search. Blocking Google-Extended removes your site from Gemini training and AI Overview synthesis but keeps your traditional Google rankings intact.

Is PerplexityBot the only crawler Perplexity uses?

Perplexity uses two crawlers: PerplexityBot for indexing and discovery, and Perplexity-User for live answer retrieval triggered by a user query. Blocking PerplexityBot prevents indexing, while blocking Perplexity-User can prevent your page from being cited in a live answer even if it is indexed.

Should I block AI crawlers if I run an ecommerce site?

Almost never. Ecommerce brands depend on being mentioned and cited in AI-driven product research, and blocking AI crawlers can quietly remove you from those answers. The right move is usually selective access plus an llms.txt file that signals which content is authoritative for AI synthesis.

Can robots.txt actually stop AI crawlers from scraping my site?

Robots.txt is a voluntary protocol. The major commercial bots (GPTBot, ClaudeBot, Google-Extended, PerplexityBot) honour it. Many smaller scrapers and dataset aggregators do not. If you need real enforcement, combine robots.txt with rate-limiting at the CDN, user-agent filtering at the firewall, and legal terms of service.

What is the difference between robots.txt and llms.txt?

Robots.txt is a permission layer that tells crawlers what they may and may not fetch. Llms.txt is a guidance layer that tells AI systems which of your accessible pages are the canonical, authoritative source on each topic. They solve different problems and are not substitutes for one another.

How often should I review my AI crawler policy?

Quarterly at minimum. New AI crawlers appear several times a year, existing ones rename their tokens, and Google has changed Google-Extended's behaviour twice since launch. A quarterly review with your robots.txt, server logs, and AI citation tracking ensures your policy still matches your visibility goals.

Will my site rank better in AI search if I unblock everything?

Crawler access is a precondition, not a ranking lever. Allowing GPTBot or PerplexityBot lets your content into the candidate pool. Whether you actually get cited depends on entity authority, content depth, schema markup, and third-party mentions. Most brands losing AI visibility do not have a crawler problem, they have an authority problem.

Should You Block GPTBot, ClaudeBot, and PerplexityBot?

Editorial hero illustrating the AI crawler access decision: a website on the left emits content into a central robots.txt gate panel that splits incoming bots - GPTBot, ClaudeBot, PerplexityBot, and Google-Extended - into an upper Allow lane that flows through to a cited AI answer on the right and a lower Block lane that ends in a sealed marker, with dashed connectors marking training crawlers and solid connectors marking retrieval crawlers.

Six months ago a B2B SaaS client called us in a panic. Their head of compliance had pushed a robots.txt update that blocked every AI crawler the security team could name. It was framed as a content-protection measure. Six weeks later their inbound demo requests from "ChatGPT recommended you" had dropped to zero, their citations on Perplexity had vanished, and the founder was furious.

The compliance team had not done anything wrong, exactly. They had treated AI crawlers as a single category and blocked them on principle. Nobody had mapped the decision against revenue.

This is the post I wish that team had read first.

Whether to block GPTBot, ClaudeBot, PerplexityBot and friends is now a real strategic question for every brand that depends on organic discovery. The default answer is not obvious, and getting it wrong in either direction is expensive. We work this question through with clients on every AI SEO engagement, and the framework below is what we actually use.

The decision in one paragraph

If your brand depends on being discovered, mentioned, or cited by humans doing research, you almost certainly want most AI crawlers reading your site. If your brand depends on a paywall or proprietary content moat, you almost certainly want most of them blocked. The interesting cases are everything in between, and the answer for those is rarely "all" or "none". It is "selectively, with monitoring, and with an llms.txt file pointing the bots that are allowed in toward your best content".

That is the short version. The rest of this post is what to do once you accept that.

The asymmetry that matters. Blocking an AI crawler is reversible. Losing six months of citation and brand-mention compounding inside ChatGPT is not. AI training cycles run quarterly. If you opt out today and change your mind in eight weeks, you are still out of the next two model refreshes.

Why this question exists in the first place

Until 2023, robots.txt was a quiet file. Most marketing teams never touched it after the launch checklist. Then in August 2023, OpenAI shipped GPTBot and gave site owners an explicit way to say no. Anthropic, Google, Common Crawl, and Perplexity followed. By the end of 2024, the question "should we block AI crawlers" was on the table at every brand we worked with.

The arguments in the room are usually some mix of:

Content protection. "We do not want our content used to train models that compete with us."
Compliance. "Legal asked us to block them until we have a policy."
Bandwidth. "These bots are hammering our servers."
Attribution. "If they read our stuff and answer the question themselves, we lose the click."
Visibility. "Our buyers ask ChatGPT first now. If we are not in there, we are invisible."

Every one of those concerns is real. None of them is the whole picture. The right policy is the one that lets you optimise the things that drive your revenue, and most of the time that is visibility, not protection. We covered the strategic backdrop in "Ranking on Google but Missing on ChatGPT? Fix This Now", and this post is the technical companion to that.

The eight bots you actually need to think about

Most of the noise on this topic conflates training crawlers with retrieval crawlers, and conflates first-party AI assistants with third-party scrapers. The real list, with what each one does, looks like this.

Bot	Operator	Purpose	Blocking impact
GPTBot	OpenAI	Training future GPT models	Excluded from next training rounds
OAI-SearchBot	OpenAI	Indexing for ChatGPT Search	No live ChatGPT Search citations
ChatGPT-User	OpenAI	User-triggered live fetch	No real-time answers from your site
ClaudeBot	Anthropic	Training future Claude models	Excluded from Anthropic training data
Google-Extended	Google	Gemini training plus AI Overviews	Removed from Gemini and AI Overviews; classic Google search unaffected
PerplexityBot	Perplexity	Indexing for Perplexity answers	No Perplexity index inclusion
Perplexity-User	Perplexity	Live retrieval for Perplexity queries	No live citation in Perplexity answers
CCBot	Common Crawl	Open-web dataset feeding most LLMs	Excluded from open datasets used by many smaller AI labs

A clean robots.txt policy speaks to all eight. A lazy one blocks "AI crawlers" with a single rule and quietly rules your brand out of multiple ecosystems at once. We see the lazy version on roughly four out of every ten technical audits we run.

The training versus retrieval distinction

This is the single most important nuance in the entire conversation, and it is the one that most marketing teams get wrong.

Training crawlers read your content and feed it into the next training run of an LLM. The result is that the model "knows about you" the way it knows about Wikipedia entries it has read. Citations from this path are unattributed. The model has absorbed the information; it does not link back.

Retrieval crawlers read your content live, in response to a user query, and the answer is generated with explicit citations and links back to your page. The user can click through. This is the path that drives traffic.

The strategic implication is simple but missed constantly: blocking retrieval crawlers is almost always a mistake, while blocking training crawlers is a defensible policy choice. If you are running a content site that monetises through traffic, attribution, or lead generation, you want the retrieval bots in. The training bots are a separate conversation.

The four positions

Once you understand training versus retrieval, every reasonable robots.txt policy collapses into one of four positions.

Position 1: Allow All

You let every well-behaved AI crawler read everything. This is the right default for almost every consumer brand, ecommerce site, agency, SaaS company, and publisher that monetises through reach rather than paywalls. We use this for the majority of clients. Pair it with a strong llms.txt and you maximise your AI surface area without any defensive overhead.

Position 2: Block All

You disallow every AI bot you can identify. This is correct only when you are running a paywalled publication, holding genuinely proprietary research that you sell, or operating in a regulated space where AI use of your content creates legal exposure. The cost is total exclusion from AI search visibility for as long as the policy stands. We have only recommended this once in the last eighteen months, and it was for a financial intelligence service whose entire revenue model was selling that content.

Position 3: Selective

You allow retrieval bots and block training bots, or you allow some operators and block others. This is the most common position for brands with mixed content, where some pages are commercial and others are gated. The technical implementation is straightforward; the operational discipline of keeping the rules current is what kills most teams. Plan for a quarterly review.

Position 4: Allow with llms.txt steering

You allow everything but use an llms.txt file to tell AI systems which of your accessible pages are the canonical sources on each topic. This is the position we recommend for content-led brands. It maximises crawl access while doing the visibility work of pointing AI systems toward your best material.

The five questions to ask before you decide

Before any client team commits to a position, we walk them through these five questions in this order. The answers usually point at one of the four positions almost without further argument.

Where does your revenue come from? If it is content discovery, AI visibility is upside; restricting it is a cost. If it is sold proprietary content, the calculation flips.
Are your competitors visible in ChatGPT and Perplexity right now? A quick sweep of category-defining queries shows you whether the AI surface is already a battlefield in your category. If it is, opting out is opting out of the battlefield.
What is your content moat? If your content is genuinely original research, a strong moat reduces the case for blocking, because original work earns disproportionate citation weight when AI systems can access it. If your content is commodity-grade, blocking does not protect anything that was a moat.
Do you have anything paywalled or partially gated? Anything behind auth should be excluded from AI crawl regardless of overall policy. The robots.txt rules for those paths should be tighter than your homepage rules.
Who in your organisation is going to monitor and review this quarterly? A robots.txt policy that nobody owns becomes wrong within six months. If no one owns it, default to a less restrictive position because the failure mode is more recoverable.

If you cannot answer four of the five, do not change your robots.txt yet. Run the audit instead. We do this as a fixed-scope exercise inside our answer engine optimisation engagements.

Robots.txt patterns you can copy

Once you have picked a position, the implementation is short. Here are the four patterns clean enough to paste directly.

Pattern 1: Allow All (most common, recommended default)

# Default - all crawlers welcome
User-agent: *
Disallow:

Sitemap: https://www.example.com/sitemap.xml

You add nothing AI-specific. Every named bot is implicitly allowed. Pair this with a strong llms.txt file and you have done your job.

Pattern 2: Block All AI

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Disallow:

This is verbose by design. Each bot is named separately because a wildcard User-agent: * rule does not apply to bots that have their own block. Skipping the per-bot rules is the most common mistake we see when teams try to block "all AI".

Pattern 3: Selective (retrieval yes, training no)

# Allow retrieval bots
User-agent: OAI-SearchBot
Disallow:

User-agent: ChatGPT-User
Disallow:

User-agent: Perplexity-User
Disallow:

# Block training bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Disallow:

This is the position that gives you the visibility upside without the model-training side. It is the most defensible policy for brands with a strong ethical position on training data and a strong commercial reason to be cited live.

Pattern 4: Path-scoped selective

# AI bots can read marketing content but not gated assets
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot
User-agent: Google-Extended
Disallow: /research/
Disallow: /reports/
Disallow: /clients/
Disallow: /private/

User-agent: *
Disallow: /private/

Use this when your site is a mix of public marketing content and gated research or client portals. The AI bots can read the marketing pages and learn your positioning, but cannot reach the proprietary material.

llms.txt is not a substitute for robots.txt

This is where teams trip up after they read the AI crawler conversation second-hand. Robots.txt is a permission layer. Llms.txt is a guidance layer. They live next to each other and solve different problems.

Robots.txt says: "you are allowed to read this, you are not allowed to read that".
Llms.txt says: "of the things you are allowed to read, here are the canonical and authoritative pages on each topic".

Blocking an AI bot in robots.txt and then publishing a polished llms.txt does nothing, because the bot never gets to the llms.txt to read it. Allowing the bot but skipping llms.txt leaves your content to be evaluated on raw signals, which underperforms a steered version. The two files are stronger together. We covered the llms.txt mechanics in detail in "What is llms.txt and why every website needs it in 2025".

How to know if your policy is working

Setting the policy is the easy part. Watching it is where most teams stop, and that is where the policy decays into a wrong answer over time.

The minimum monitoring stack we run for clients on AI search engagements:

Server log review. Once a month, sample your access logs and confirm the bots you allow are actually showing up, and the bots you block are not. If you blocked GPTBot and it is still hammering you, your robots.txt has a syntax error.
AI citation tracking. Run a fixed list of category-defining queries through ChatGPT, Perplexity, and Gemini once a week. Track whether your brand appears, in what position, and against which competitors. We covered the tooling for this in the AI search gap post.
GSC review for AI Overview impressions. Google Search Console is starting to expose AI Overview impression and click data. If you blocked Google-Extended, watch for the disappearance of AI Overview impressions; if you did not, watch for them to grow.
Quarterly robots.txt diff review. Save a copy of your robots.txt every quarter and review what changed. If nothing changed, ask whether the world changed and you missed updating.
New crawler list refresh. New AI crawler tokens appear roughly every quarter. Review OpenAI, Anthropic, and Perplexity documentation pages on a recurring calendar invite.

If a team cannot commit to even the first two of those, we recommend Position 1 (Allow All) by default, because the failure mode of an unattended Position 3 policy is silent visibility loss. The failure mode of an unattended Position 1 policy is at most a slow leak of training data, which is recoverable; visibility you never earned is much harder to recover.

For most of the brands we work with at Nico Digital, the answer is Position 1 or Position 4. The reasoning is unromantic. Most of our clients monetise through visibility - their lead generation, their ecommerce checkouts, their SEO-driven pipeline all depend on showing up where buyers look. Closing the AI surface to protect content that was already being read by humans for free is rarely the right trade.

We do recommend Position 3 for clients with original research, proprietary methodology documents, or paywalled publications. We have recommended Position 2 exactly once in the last eighteen months, for a regulated financial-intelligence service whose entire commercial model was selling primary research.

If you are unsure where your brand sits, the cheapest first step is a category-level visibility audit. Run a structured set of buyer-stage queries through the major AI assistants and see whether you are present, mentioned, or invisible. That data, more than anything else, will tell you what your robots.txt should look like. We do this work as part of every AI SEO engagement, and it is also discussed in our pillar on SEO vs AEO vs GEO.

The honest summary. For 80% of brands, "Allow All plus a strong llms.txt" is the right policy. For 15%, "Selective with retrieval bots allowed and training bots blocked". For the remaining 5%, "Block All". Pick yours from the data, not from instinct, and review it every quarter.

Where to go next

If you found this useful, the natural reading order from here is:

What is llms.txt and why every website needs it for the file that should sit next to your robots.txt
How to rank on ChatGPT for the upstream visibility playbook
How to rank on Perplexity for the sister tactical guide
Google AI Overviews explained for the Google side of the story
Is SEO dead in 2026? for the strategic backdrop
How SEOs are adapting to the num=100 removal for the parallel infrastructure story

When you are ready to align your AI crawler policy with your visibility goals, talk to us at Nico Digital or read more about how we approach this on how to evaluate an SEO agency for AI search. The policy review usually takes a single working session. The visibility recovery takes a quarter.

Aditya Kathotia