Technical SEO

Should You Block GPTBot, ClaudeBot, and PerplexityBot?

·2026-05-08·13 min read
Editorial hero illustrating the AI crawler access decision: a website on the left emits content into a central robots.txt gate panel that splits incoming bots - GPTBot, ClaudeBot, PerplexityBot, and Google-Extended - into an upper Allow lane that flows through to a cited AI answer on the right and a lower Block lane that ends in a sealed marker, with dashed connectors marking training crawlers and solid connectors marking retrieval crawlers.

Six months ago a B2B SaaS client called us in a panic. Their head of compliance had pushed a robots.txt update that blocked every AI crawler the security team could name. It was framed as a content-protection measure. Six weeks later their inbound demo requests from "ChatGPT recommended you" had dropped to zero, their citations on Perplexity had vanished, and the founder was furious.

The compliance team had not done anything wrong, exactly. They had treated AI crawlers as a single category and blocked them on principle. Nobody had mapped the decision against revenue.

This is the post I wish that team had read first.

Whether to block GPTBot, ClaudeBot, PerplexityBot and friends is now a real strategic question for every brand that depends on organic discovery. The default answer is not obvious, and getting it wrong in either direction is expensive. We work this question through with clients on every AI SEO engagement, and the framework below is what we actually use.

The decision in one paragraph

If your brand depends on being discovered, mentioned, or cited by humans doing research, you almost certainly want most AI crawlers reading your site. If your brand depends on a paywall or proprietary content moat, you almost certainly want most of them blocked. The interesting cases are everything in between, and the answer for those is rarely "all" or "none". It is "selectively, with monitoring, and with an llms.txt file pointing the bots that are allowed in toward your best content".

That is the short version. The rest of this post is what to do once you accept that.

The asymmetry that matters. Blocking an AI crawler is reversible. Losing six months of citation and brand-mention compounding inside ChatGPT is not. AI training cycles run quarterly. If you opt out today and change your mind in eight weeks, you are still out of the next two model refreshes.

Why this question exists in the first place

Until 2023, robots.txt was a quiet file. Most marketing teams never touched it after the launch checklist. Then in August 2023, OpenAI shipped GPTBot and gave site owners an explicit way to say no. Anthropic, Google, Common Crawl, and Perplexity followed. By the end of 2024, the question "should we block AI crawlers" was on the table at every brand we worked with.

The arguments in the room are usually some mix of:

  • Content protection. "We do not want our content used to train models that compete with us."
  • Compliance. "Legal asked us to block them until we have a policy."
  • Bandwidth. "These bots are hammering our servers."
  • Attribution. "If they read our stuff and answer the question themselves, we lose the click."
  • Visibility. "Our buyers ask ChatGPT first now. If we are not in there, we are invisible."

Every one of those concerns is real. None of them is the whole picture. The right policy is the one that lets you optimise the things that drive your revenue, and most of the time that is visibility, not protection. We covered the strategic backdrop in "Ranking on Google but Missing on ChatGPT? Fix This Now", and this post is the technical companion to that.

The eight bots you actually need to think about

Most of the noise on this topic conflates training crawlers with retrieval crawlers, and conflates first-party AI assistants with third-party scrapers. The real list, with what each one does, looks like this.

BotOperatorPurposeBlocking impact
GPTBotOpenAITraining future GPT modelsExcluded from next training rounds
OAI-SearchBotOpenAIIndexing for ChatGPT SearchNo live ChatGPT Search citations
ChatGPT-UserOpenAIUser-triggered live fetchNo real-time answers from your site
ClaudeBotAnthropicTraining future Claude modelsExcluded from Anthropic training data
Google-ExtendedGoogleGemini training plus AI OverviewsRemoved from Gemini and AI Overviews; classic Google search unaffected
PerplexityBotPerplexityIndexing for Perplexity answersNo Perplexity index inclusion
Perplexity-UserPerplexityLive retrieval for Perplexity queriesNo live citation in Perplexity answers
CCBotCommon CrawlOpen-web dataset feeding most LLMsExcluded from open datasets used by many smaller AI labs

A clean robots.txt policy speaks to all eight. A lazy one blocks "AI crawlers" with a single rule and quietly rules your brand out of multiple ecosystems at once. We see the lazy version on roughly four out of every ten technical audits we run.

The training versus retrieval distinction

This is the single most important nuance in the entire conversation, and it is the one that most marketing teams get wrong.

Your SiteService pagesBlog + insightsCase studiesTraining crawlersGPTBot · ClaudeBot · Google-Extended · CCBotRead once, every few monthsYour content shapes future model knowledgeRetrieval crawlersOAI-SearchBot · ChatGPT-User · Perplexity-UserRead on demand, when a user asksYour page can be cited live with attribution

Training crawlers read your content and feed it into the next training run of an LLM. The result is that the model "knows about you" the way it knows about Wikipedia entries it has read. Citations from this path are unattributed. The model has absorbed the information; it does not link back.

Retrieval crawlers read your content live, in response to a user query, and the answer is generated with explicit citations and links back to your page. The user can click through. This is the path that drives traffic.

The strategic implication is simple but missed constantly: blocking retrieval crawlers is almost always a mistake, while blocking training crawlers is a defensible policy choice. If you are running a content site that monetises through traffic, attribution, or lead generation, you want the retrieval bots in. The training bots are a separate conversation.

The four positions

Once you understand training versus retrieval, every reasonable robots.txt policy collapses into one of four positions.

Position 1: Allow All

You let every well-behaved AI crawler read everything. This is the right default for almost every consumer brand, ecommerce site, agency, SaaS company, and publisher that monetises through reach rather than paywalls. We use this for the majority of clients. Pair it with a strong llms.txt and you maximise your AI surface area without any defensive overhead.

Position 2: Block All

You disallow every AI bot you can identify. This is correct only when you are running a paywalled publication, holding genuinely proprietary research that you sell, or operating in a regulated space where AI use of your content creates legal exposure. The cost is total exclusion from AI search visibility for as long as the policy stands. We have only recommended this once in the last eighteen months, and it was for a financial intelligence service whose entire revenue model was selling that content.

Position 3: Selective

You allow retrieval bots and block training bots, or you allow some operators and block others. This is the most common position for brands with mixed content, where some pages are commercial and others are gated. The technical implementation is straightforward; the operational discipline of keeping the rules current is what kills most teams. Plan for a quarterly review.

Position 4: Allow with llms.txt steering

You allow everything but use an llms.txt file to tell AI systems which of your accessible pages are the canonical sources on each topic. This is the position we recommend for content-led brands. It maximises crawl access while doing the visibility work of pointing AI systems toward your best material.

AI crawler access decision treeIs your content paywalled?YESNOPosition 2Block All AI botsDo you depend on AI visibility?YESNOT YETPosition 4Allow + llms.txt steeringPosition 3Selective: retrieval yes, training noPosition 1 (Allow All) is the implicit default if no rules are added.

The five questions to ask before you decide

Before any client team commits to a position, we walk them through these five questions in this order. The answers usually point at one of the four positions almost without further argument.

  1. Where does your revenue come from? If it is content discovery, AI visibility is upside; restricting it is a cost. If it is sold proprietary content, the calculation flips.
  2. Are your competitors visible in ChatGPT and Perplexity right now? A quick sweep of category-defining queries shows you whether the AI surface is already a battlefield in your category. If it is, opting out is opting out of the battlefield.
  3. What is your content moat? If your content is genuinely original research, a strong moat reduces the case for blocking, because original work earns disproportionate citation weight when AI systems can access it. If your content is commodity-grade, blocking does not protect anything that was a moat.
  4. Do you have anything paywalled or partially gated? Anything behind auth should be excluded from AI crawl regardless of overall policy. The robots.txt rules for those paths should be tighter than your homepage rules.
  5. Who in your organisation is going to monitor and review this quarterly? A robots.txt policy that nobody owns becomes wrong within six months. If no one owns it, default to a less restrictive position because the failure mode is more recoverable.

If you cannot answer four of the five, do not change your robots.txt yet. Run the audit instead. We do this as a fixed-scope exercise inside our answer engine optimisation engagements.

Robots.txt patterns you can copy

Once you have picked a position, the implementation is short. Here are the four patterns clean enough to paste directly.

# Default - all crawlers welcome
User-agent: *
Disallow:

Sitemap: https://www.example.com/sitemap.xml

You add nothing AI-specific. Every named bot is implicitly allowed. Pair this with a strong llms.txt file and you have done your job.

Pattern 2: Block All AI

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Disallow:

This is verbose by design. Each bot is named separately because a wildcard User-agent: * rule does not apply to bots that have their own block. Skipping the per-bot rules is the most common mistake we see when teams try to block "all AI".

Pattern 3: Selective (retrieval yes, training no)

# Allow retrieval bots
User-agent: OAI-SearchBot
Disallow:

User-agent: ChatGPT-User
Disallow:

User-agent: Perplexity-User
Disallow:

# Block training bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Disallow:

This is the position that gives you the visibility upside without the model-training side. It is the most defensible policy for brands with a strong ethical position on training data and a strong commercial reason to be cited live.

Pattern 4: Path-scoped selective

# AI bots can read marketing content but not gated assets
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot
User-agent: Google-Extended
Disallow: /research/
Disallow: /reports/
Disallow: /clients/
Disallow: /private/

User-agent: *
Disallow: /private/

Use this when your site is a mix of public marketing content and gated research or client portals. The AI bots can read the marketing pages and learn your positioning, but cannot reach the proprietary material.

llms.txt is not a substitute for robots.txt

This is where teams trip up after they read the AI crawler conversation second-hand. Robots.txt is a permission layer. Llms.txt is a guidance layer. They live next to each other and solve different problems.

  • Robots.txt says: "you are allowed to read this, you are not allowed to read that".
  • Llms.txt says: "of the things you are allowed to read, here are the canonical and authoritative pages on each topic".

Blocking an AI bot in robots.txt and then publishing a polished llms.txt does nothing, because the bot never gets to the llms.txt to read it. Allowing the bot but skipping llms.txt leaves your content to be evaluated on raw signals, which underperforms a steered version. The two files are stronger together. We covered the llms.txt mechanics in detail in "What is llms.txt and why every website needs it in 2025".

How to know if your policy is working

Setting the policy is the easy part. Watching it is where most teams stop, and that is where the policy decays into a wrong answer over time.

The minimum monitoring stack we run for clients on AI search engagements:

  • Server log review. Once a month, sample your access logs and confirm the bots you allow are actually showing up, and the bots you block are not. If you blocked GPTBot and it is still hammering you, your robots.txt has a syntax error.
  • AI citation tracking. Run a fixed list of category-defining queries through ChatGPT, Perplexity, and Gemini once a week. Track whether your brand appears, in what position, and against which competitors. We covered the tooling for this in the AI search gap post.
  • GSC review for AI Overview impressions. Google Search Console is starting to expose AI Overview impression and click data. If you blocked Google-Extended, watch for the disappearance of AI Overview impressions; if you did not, watch for them to grow.
  • Quarterly robots.txt diff review. Save a copy of your robots.txt every quarter and review what changed. If nothing changed, ask whether the world changed and you missed updating.
  • New crawler list refresh. New AI crawler tokens appear roughly every quarter. Review OpenAI, Anthropic, and Perplexity documentation pages on a recurring calendar invite.

If a team cannot commit to even the first two of those, we recommend Position 1 (Allow All) by default, because the failure mode of an unattended Position 3 policy is silent visibility loss. The failure mode of an unattended Position 1 policy is at most a slow leak of training data, which is recoverable; visibility you never earned is much harder to recover.

What we actually recommend at Nico Digital

For most of the brands we work with at Nico Digital, the answer is Position 1 or Position 4. The reasoning is unromantic. Most of our clients monetise through visibility - their lead generation, their ecommerce checkouts, their SEO-driven pipeline all depend on showing up where buyers look. Closing the AI surface to protect content that was already being read by humans for free is rarely the right trade.

We do recommend Position 3 for clients with original research, proprietary methodology documents, or paywalled publications. We have recommended Position 2 exactly once in the last eighteen months, for a regulated financial-intelligence service whose entire commercial model was selling primary research.

If you are unsure where your brand sits, the cheapest first step is a category-level visibility audit. Run a structured set of buyer-stage queries through the major AI assistants and see whether you are present, mentioned, or invisible. That data, more than anything else, will tell you what your robots.txt should look like. We do this work as part of every AI SEO engagement, and it is also discussed in our pillar on SEO vs AEO vs GEO.

The honest summary. For 80% of brands, "Allow All plus a strong llms.txt" is the right policy. For 15%, "Selective with retrieval bots allowed and training bots blocked". For the remaining 5%, "Block All". Pick yours from the data, not from instinct, and review it every quarter.

Where to go next

If you found this useful, the natural reading order from here is:

When you are ready to align your AI crawler policy with your visibility goals, talk to us at Nico Digital or read more about how we approach this on how to evaluate an SEO agency for AI search. The policy review usually takes a single working session. The visibility recovery takes a quarter.

Aditya Kathotia

Aditya Kathotia

Founder & CEO

CEO of Nico Digital and founder of Digital Polo, Aditya Kathotia is a trailblazer in digital marketing. He's powered 500+ brands through transformative strategies, enabling clients worldwide to grow revenue exponentially. Aditya's work has been featured on Entrepreneur, Economic Times, Hubspot, Business.com, Clutch, and more. Join Aditya Kathotia's orbit on LinkedIn to gain exclusive access to his treasure trove of niche-specific marketing secrets and insights.

Want to explore working together?

Let's talk about how we can grow your digital presence and increase inbound business.