Every AI Crawler You're Accidentally Blocking in Q3 2026

person

AI SEO Intelligence

calendar_today May 25, 2026

schedule 10 min read

Every AI Crawler You're Accidentally Blocking in Q3 2026

The robots.txt file on most production sites was written before half the AI crawlers it is now being asked to govern existed. Between February and April 2026, four major vendors - Anthropic, OpenAI, Apple, and Perplexity - quietly added or formally documented new user-agents. None of them issued a press release. The user-agents show up in your access logs and either get matched against rules you never wrote, or they fall through to whatever the wildcard User-agent: * block happens to say.

We refreshed our own crawler map this week against authoritative vendor docs and the Q1 2026 Cloudflare network analyses. The short version is that the "block GPTBot, allow Googlebot" mental model that most robots.txt files still operate on is now governing eleven-plus distinct crawlers across seven vendors, several of which have splintered into two or three siblings since their original launch. The longer version - what each bot does, which ones cost you AI search visibility when blocked, and which are safe to disallow - is below.

Key Takeaways

Anthropic split its single ClaudeBot crawler into three documented bots in February 2026: ClaudeBot (training), Claude-User (user-triggered fetches), and Claude-SearchBot (search index).
OpenAI added a fourth bot, OAI-AdsBot, to its public crawler docs in Q1 2026. It only visits pages submitted as ads in ChatGPT, but it shows up in your logs and gets caught by overly broad wildcard blocks.
Applebot-Extended has been documented since 2024 as an opt-out signal for Apple Intelligence training. Its traffic share is growing but its adoption in explicit robots.txt rules remains low - most sites have no rule for it at all.
The most common accidental block is the search-and-retrieval crawlers - Claude-SearchBot, Perplexity-User, OAI-SearchBot, ChatGPT-User - getting swept up by a User-agent: * Disallow: rule meant to stop training. Blocking them removes you from the answer surface their training counterparts never powered in the first place.

The 2025-2026 Splintering Pattern

The biggest structural change since the original 2023-2024 AI crawler wave is that the major vendors stopped using a single user-agent for everything. OpenAI started the pattern in 2024 by separating GPTBot (training), OAI-SearchBot (the SearchGPT index), and ChatGPT-User (user-triggered browsing). Anthropic mirrored it in early 2026. Perplexity now distinguishes PerplexityBot (the crawler) from Perplexity-User (the user-triggered fetcher). Google has long maintained the Googlebot / Google-Extended split.

The pattern matters because the bots in each split do fundamentally different things and the cost of blocking them is fundamentally different:

Training crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot, Bytespider). These fetch content to improve the underlying model. Blocking them means your future content will not be used to train the next model generation. That is a legitimate editorial choice with no immediate user-visibility cost.
Search/retrieval crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Bingbot, Googlebot). These build the real-time index the AI uses to answer questions. Blocking them removes you from ChatGPT Search results, Claude's search results, Perplexity answers, Copilot, AI Overviews. The cost is immediate and measurable in lost referral traffic.
User-triggered fetchers (ChatGPT-User, Claude-User, Perplexity-User). These fire when a logged-in user pastes your URL or asks the assistant to read a specific page. Blocking them means the assistant will tell its user it cannot access your page. This is the most embarrassing block - it surfaces directly in the user's chat as a refusal.

Most robots.txt files we have audited do not distinguish among these three. A site that adds Disallow: / under User-agent: * to stop training is also, accidentally, telling the search crawler not to index it and the user-triggered fetcher to refuse direct user requests.

What Changed Between February and May 2026

Four concrete vendor changes in the last 90 days that most site owners have not yet caught up to:

Anthropic's three-bot framework (February 2026)

Anthropic updated its crawler documentation in February 2026 to formally describe three distinct bots. The rationale is the standard one: separate training from search from user-triggered fetches so site owners can make granular decisions.

ClaudeBot - training. Same user-agent as the 2024 bot. If you already had a rule for this, it still applies.
Claude-SearchBot - search/retrieval. New token. Powers Claude's search results.
Claude-User - user-triggered fetcher. Fires when a Claude user explicitly asks the assistant to read a URL.

A site that had User-agent: ClaudeBot / Disallow: / in their robots.txt is unaffected by the rename. A site that had User-agent: * / Disallow: / and assumed they were only blocking training is now also blocking search and user fetches that they probably did not intend to.

OpenAI's OAI-AdsBot (April 2026)

OpenAI added OAI-AdsBot to its public crawler documentation in Q1 2026. The bot is narrow in scope: it only visits pages that an advertiser has submitted as an ad landing page in ChatGPT. The user-agent string is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-AdsBot/1.0; +https://openai.com/adsbot.

Two things to know. First, OpenAI says the data is not used to train its foundation models - the bot's only purpose is ad-policy validation. Second, unlike OpenAI's other three bots, there is no openai.com/adsbot.json IP range published at the time of writing, which means firewalls cannot allowlist it the same way as GPTBot or OAI-SearchBot. If you run ChatGPT ads, a wildcard block in robots.txt will quietly cause OpenAI's bot to fail its policy check and may delay your ad approval.

Applebot-Extended's quiet traffic surge (Q1-Q2 2026)

Applebot-Extended itself is not new - Apple introduced the token in 2024 as an opt-out signal for Apple Intelligence training. What changed is that its share of AI crawler traffic has been growing through Q1-Q2 2026, while explicit robots.txt adoption has lagged. Most non-news sites have no rule for Applebot-Extended at all - they are implicitly consenting through a missing rule. Several major publishers have added an explicit block; most sites have not decided. Worth deciding deliberately.

Perplexity's user-agent stance (ongoing, still contested)

Perplexity's split between PerplexityBot (the crawler) and Perplexity-User (the user-triggered fetcher) has been documented for a while, but the company's public position is that Perplexity-User is "an agent, not a bot" and therefore is not required to honor robots.txt directives. Cloudflare documented in August 2025 that Perplexity was using stealth, undeclared crawlers to bypass blocks. That position has not formally changed in 2026.

Practical implication: a rule like User-agent: Perplexity-User / Disallow: / is what Perplexity asks you to write, but the user-triggered traffic may continue regardless. The cleaner signal is at the firewall layer using the published perplexity-user.json IP range, not at the robots.txt layer.

How Discovery Mechanisms Are Shifting

Robots.txt is no longer the only signal AI crawlers consult to decide what to fetch. A second, complementary mechanism has been quietly gaining adoption since mid-2024: /llms.txt, a plain-text file at the root of your domain that points AI systems at the documents you want them to use. Our companion post on llms.txt one year later walks through actual adoption rates and which AI systems honor it. The short version: robots.txt still decides whether a crawler can fetch you, but llms.txt increasingly decides what it fetches first. The two files solve different parts of the same problem and our audit checks both.

If you treat AI crawler discovery as a robots.txt-only problem in 2026, you are arguing with the wrong file. A site can be technically allowed by robots.txt and still be invisible in AI answers because the crawler never picks it out of the long-tail. The combination - explicit Allow: for the search bots in robots.txt, plus an llms.txt that tells those bots what to prioritize - is the configuration that maps cleanly to the new crawler topology.

There is a second layer most "let the AI bots in" guides skip past: even when a crawler is allowed and reaches your page, the bulk of them never execute the JavaScript that would reveal your actual content. Our companion post on JavaScript rendering across AI crawlers walks through the per-crawler matrix - the practical implication is that allowing GPTBot or ClaudeBot through robots.txt is necessary but not sufficient if your content only exists after a client-side render.

The State of robots.txt in 2026

The No Hacks analysis of the AI user-agent landscape and the SEJ analysis of 68 million AI crawler visits converge on the same structural observation: the vast majority of sites are not explicitly governing any individual AI crawler, relying instead on wildcard inheritance. GPTBot and ClaudeBot are the most-commonly blocked bots when sites do add explicit rules, but the absolute blocking rates remain low - overwhelmingly, sites are implicitly allowing AI crawlers through silence.

New tokens like Claude-SearchBot and OAI-AdsBot do not appear in most robots.txt files at all because almost nobody has added explicit rules for them yet.

That is the opportunity and the risk. The opportunity: if you do nothing, you are implicitly granting access to all the new bots and getting whatever AI-search visibility comes with that. The risk: if your existing wildcard rules accidentally block them, you are losing visibility on surfaces you probably wanted to be on, and you will not see it in any crawl-log report because the bots either succeeded silently or got 403s you never looked at.

A 7-Minute Self-Audit for Your robots.txt

Pull up your site's robots.txt (just append /robots.txt to your domain). Walk it through this:

Find your wildcard block. Look for User-agent: * followed by Disallow: / or Disallow: /private/ patterns. Every AI crawler that does not have its own dedicated section will inherit this rule.
List every User-agent: directive you have explicit rules for. If the list does not include all three of ClaudeBot, Claude-SearchBot, and Claude-User, you are governing Anthropic's bots through inheritance, not intention. Same exercise for the OpenAI four (GPTBot, OAI-SearchBot, ChatGPT-User, OAI-AdsBot) and the Perplexity two (PerplexityBot, Perplexity-User).
Decide on training separately from search. A defensible 2026 policy is: training crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot, Bytespider) get whatever rule matches your stance on AI training data; search crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Bingbot, Googlebot) get Allow: /. The two decisions are independent and should not be coupled by accident.
Decide on user-triggered fetchers explicitly. ChatGPT-User, Claude-User, and Perplexity-User fire when a real user is asking the assistant about your specific page. The cost of blocking them is "the assistant tells the user it cannot read your page," in front of that user, in real time. Almost no site wants this to happen unless they are deliberately blocking AI assistants entirely.
Confirm the bot list against an authoritative source. Vendor docs are the source of truth: OpenAI's crawler docs, Anthropic's bot documentation, Perplexity crawler docs, Google's crawler list, and Apple's Applebot docs. Anything you have not verified against the vendor in the last six months is a candidate for re-checking.
Run an audit. Our audit flags the AI-search bots that have no explicit rule in your robots.txt and tells you which dimension of AI visibility the omission affects. The bot list is part of the catalog we maintain against vendor documentation; when we add a newly-documented bot to the catalog, every page audited from that point on starts checking it. If your robots.txt has not been touched since 2024, the audit will surface every new bot in one report.

What This Means for Your Q3 2026 Roadmap

Three concrete things worth queuing.

Make the training-vs-search decision explicit. If you do not have separate User-agent: sections for the training bots and the search bots, you are coupling two independent decisions and probably under- or over-blocking on at least one of them. The split takes ten minutes to write and resolves a class of ambiguity that has only gotten worse as vendors have splintered their crawlers.

Add the four 2026-documented user-agents (Claude-SearchBot, Claude-User, OAI-AdsBot, plus an explicit Perplexity-User decision). Even if your policy is "default-allow," the explicit rule is the documentation of the decision. Site owners six months from now - including future you - will not have to reverse-engineer the wildcard inheritance to understand what is governing what.

Re-check the catalog twice a year. The bot landscape was stable from roughly mid-2024 through late 2025 and has been moving steadily through Q1-Q2 2026.

Update - shipped 2026-05-25. Five of the eight newly-documented or rapidly-growing crawlers are now part of our default audit catalog: Claude-SearchBot, Claude-User, Applebot-Extended, Perplexity-User, and OAI-AdsBot. Re-audit any site to see them surface alongside the existing ten. The three not yet added - Amazonbot, Meta-ExternalAgent, MistralAI-User - are held for the Q4 review pending traffic-share data we trust; none currently appear in the Cloudflare Q1 2026 top-six blocked ranking. The threshold for inclusion is unchanged: documented user-agent in the vendor's own docs, plus evidence of measurable user-visible impact when blocked.

The robots.txt file is one of the cheapest configuration assets a site owns. It is also one of the most-neglected. The default of "implicit consent through silence" worked when there were three crawlers to think about. With eleven-plus, the silence is doing more work than most site owners want it to.

Want to see which AI crawlers have explicit rules on your site and which are being governed by inheritance? Run a free audit at hybridranking.com. The bot-access check is part of the AI-visibility surface we evaluate, and the catalog is refreshed against vendor docs as new tokens are published.