Every AI Crawler You're Accidentally Blocking in Q3 2026
The robots.txt file on most production sites was written before half the AI crawlers it is now being asked to govern existed. Between February and April 2026, four major vendors — Anthropic, OpenAI, Apple, and Perplexity — quietly added or formally documented new user-agents. None of them issued a press release. The user-agents show up in your access logs and either get matched against rules you never wrote, or they fall through to whatever the wildcard User-agent: * block happens to say.
We refreshed our own crawler map this week against authoritative vendor docs and the Q1 2026 Cloudflare network analyses. The short version is that the "block GPTBot, allow Googlebot" mental model that most robots.txt files still operate on is now governing eleven-plus distinct crawlers across seven vendors, several of which have splintered into two or three siblings since their original launch. The longer version — what each bot does, which ones cost you AI search visibility when blocked, and which are safe to disallow — is below.
Key Takeaways
- Anthropic split its single ClaudeBot crawler into three documented bots in February 2026: ClaudeBot (training), Claude-User (user-triggered fetches), and Claude-SearchBot (search index).
- OpenAI added a fourth bot, OAI-AdsBot, to its public crawler docs in April 2026. It only visits pages submitted as ads in ChatGPT, but it shows up in your logs and gets caught by overly broad wildcard blocks.
- Applebot-Extended has been documented since 2024, but its traffic share more than doubled to ~9.2% of all AI crawler traffic in April 2026 — putting it within 0.6 percentage points of GPTBot. About 25% of news publishers block it; most other sites haven't decided.
- The Q1 2026 Cloudflare-network analysis of 4,128 robots.txt files showed the most-blocked AI bots are still GPTBot (5.52%), CCBot (5.08%), ClaudeBot (4.88%), Google-Extended (4.44%), and Bytespider (4.23%) — almost identical to the 2024 ranking, suggesting that publishers' robots.txt files are evolving much more slowly than the crawler landscape.
- The most common accidental block is the search-and-retrieval crawlers — Claude-SearchBot, Perplexity-User, OAI-SearchBot, ChatGPT-User — getting swept up by a
User-agent: *Disallow:rule meant to stop training. Blocking them removes you from the answer surface their training counterparts never powered in the first place.
The 2025-2026 Splintering Pattern
The biggest structural change since the original 2023-2024 AI crawler wave is that the major vendors stopped using a single user-agent for everything. OpenAI started the pattern in 2024 by separating GPTBot (training), OAI-SearchBot (the SearchGPT index), and ChatGPT-User (user-triggered browsing). Anthropic mirrored it in early 2026. Perplexity now distinguishes PerplexityBot (the crawler) from Perplexity-User (the user-triggered fetcher). Google has long maintained the Googlebot / Google-Extended split.
The pattern matters because the bots in each split do fundamentally different things and the cost of blocking them is fundamentally different:
- Training crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot, Bytespider). These fetch content to improve the underlying model. Blocking them means your future content will not be used to train the next model generation. That is a legitimate editorial choice with no immediate user-visibility cost.
- Search/retrieval crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Bingbot, Googlebot). These build the real-time index the AI uses to answer questions. Blocking them removes you from ChatGPT Search results, Claude's search results, Perplexity answers, Copilot, AI Overviews. The cost is immediate and measurable in lost referral traffic.
- User-triggered fetchers (ChatGPT-User, Claude-User, Perplexity-User). These fire when a logged-in user pastes your URL or asks the assistant to read a specific page. Blocking them means the assistant will tell its user it cannot access your page. This is the most embarrassing block — it surfaces directly in the user's chat as a refusal.
Most robots.txt files we have audited do not distinguish among these three. A site that adds Disallow: / under User-agent: * to stop training is also, accidentally, telling the search crawler not to index it and the user-triggered fetcher to refuse direct user requests.
What Changed Between February and May 2026
Four concrete vendor changes in the last 90 days that most site owners have not yet caught up to:
Anthropic's three-bot framework (February 2026)
Anthropic updated its crawler documentation in February 2026 to formally describe three distinct bots. The split was first spotted on February 20 and the rationale is the standard one: separate training from search from user-triggered fetches so site owners can make granular decisions.
- ClaudeBot — training. Same user-agent as the 2024 bot. If you already had a rule for this, it still applies.
- Claude-SearchBot — search/retrieval. New token. Powers Claude's search results.
- Claude-User — user-triggered fetcher. Fires when a Claude user explicitly asks the assistant to read a URL.
A site that had User-agent: ClaudeBot / Disallow: / in their robots.txt is unaffected by the rename. A site that had User-agent: * / Disallow: / and assumed they were only blocking training is now also blocking search and user fetches that they probably did not intend to.
OpenAI's OAI-AdsBot (April 2026)
OpenAI added OAI-AdsBot to its public crawler docs in April 2026 — the public spot date was April 21, via SEO consultant Glenn Gabe sharing the updated documentation. The bot is narrow in scope: it only visits pages that an advertiser has submitted as an ad landing page in ChatGPT. The user-agent string is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-AdsBot/1.0; +https://openai.com/adsbot.
Two things to know. First, OpenAI says the data is not used to train its foundation models — the bot's only purpose is ad-policy validation. Second, unlike OpenAI's other three bots, there is no openai.com/adsbot.json IP range published at the time of writing, which means firewalls cannot allowlist it the same way as GPTBot or OAI-SearchBot. If you run ChatGPT ads, a wildcard block in robots.txt will quietly cause OpenAI's bot to fail its policy check and may delay your ad approval.
Applebot-Extended's quiet traffic surge (Q1-Q2 2026)
Applebot-Extended itself is not new — Apple introduced the token in 2024 as an opt-out signal for Apple Intelligence training. What changed is volume. Cloudflare and Applebot traffic data through April 2026 shows Apple's crawler share rose from a low base in late 2025 to 9.23% of AI crawler traffic in April 2026, its second consecutive double-digit-percent monthly gain. That puts it inside half a percentage point of GPTBot.
The blocking rate has not kept up. Around 25% of surveyed news sites block Applebot-Extended; most non-news sites have no rule for it at all. The New York Times, The Financial Times, The Atlantic, Vox Media, and Condé Nast are blocking. Most everyone else has implicit consent through a missing rule. Worth deciding deliberately.
Perplexity's user-agent stance (ongoing, still contested)
Perplexity's split between PerplexityBot (the crawler) and Perplexity-User (the user-triggered fetcher) has been documented for a while, but the company's public position is that Perplexity-User is "an agent, not a bot" and therefore is not required to honor robots.txt directives. Cloudflare documented in August 2025 that Perplexity was using stealth, undeclared crawlers to bypass blocks. That position has not formally changed in 2026.
Practical implication: a rule like User-agent: Perplexity-User / Disallow: / is what Perplexity asks you to write, but the user-triggered traffic may continue regardless. The cleaner signal is at the firewall layer using the published perplexity-user.json IP range, not at the robots.txt layer.
How Discovery Mechanisms Are Shifting
Robots.txt is no longer the only signal AI crawlers consult to decide what to fetch. A second, complementary mechanism has been quietly gaining adoption since mid-2024: /llms.txt, a plain-text file at the root of your domain that points AI systems at the documents you want them to use. Our companion post on llms.txt one year later walks through actual adoption rates and which AI systems honor it. The short version: robots.txt still decides whether a crawler can fetch you, but llms.txt increasingly decides what it fetches first. The two files solve different parts of the same problem and our audit checks both.
If you treat AI crawler discovery as a robots.txt-only problem in 2026, you are arguing with the wrong file. A site can be technically allowed by robots.txt and still be invisible in AI answers because the crawler never picks it out of the long-tail. The combination — explicit Allow: for the search bots in robots.txt, plus an llms.txt that tells those bots what to prioritize — is the configuration that maps cleanly to the new crawler topology.
The Q1 2026 Blocking Data
The most recent network-level snapshot we have is the TechnologyChecker / Cloudflare-network analysis, published April 3, 2026 and updated May 2. They parsed 4,128 robots.txt files. The Disallow-share rankings:
| Crawler | Disallow share | Notes |
|---|---|---|
| GPTBot | 5.52% | Most-blocked AI bot for the third consecutive quarter |
| CCBot | 5.08% | Common Crawl — upstream source for most open LLM datasets |
| ClaudeBot | 4.88% | Up from ~3% in 2024; Q1 2026 was the fastest-growing block |
| Google-Extended | 4.44% | Gemini training opt-out |
| Bytespider | 4.23% | TikTok parent ByteDance |
| Applebot-Extended | 3.67% | Disproportionately concentrated in news publishers |
| PerplexityBot | (appears in Allow rules at 5.16%, not Disallow) | Net-positive among publishers — most explicit rules welcome it |
Two things jump out. First, the absolute numbers are small — 95-plus percent of domains are not blocking any individual AI crawler, which makes the cumulative AI traffic share possible. Second, the rankings have barely moved since 2024. The robots.txt files in the wild are not tracking the crawler split-and-rename activity of the last year. New tokens like Claude-SearchBot and OAI-AdsBot do not appear in the ranking at all, because almost nobody has added rules for them yet.
That is the opportunity and the risk. The opportunity: if you do nothing, you are implicitly granting access to all the new bots and getting whatever AI-search visibility comes with that. The risk: if your existing wildcard rules accidentally block them, you are losing visibility on surfaces you probably wanted to be on, and you will not see it in any crawl-log report because the bots either succeeded silently or got 403s you never looked at.
A 7-Minute Self-Audit for Your robots.txt
Pull up your site's robots.txt (just append /robots.txt to your domain). Walk it through this:
- Find your wildcard block. Look for
User-agent: *followed byDisallow: /orDisallow: /private/patterns. Every AI crawler that does not have its own dedicated section will inherit this rule. - List every
User-agent:directive you have explicit rules for. If the list does not include all three of ClaudeBot, Claude-SearchBot, and Claude-User, you are governing Anthropic's bots through inheritance, not intention. Same exercise for the OpenAI four (GPTBot, OAI-SearchBot, ChatGPT-User, OAI-AdsBot) and the Perplexity two (PerplexityBot, Perplexity-User). - Decide on training separately from search. A defensible 2026 policy is: training crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot, Bytespider) get whatever rule matches your stance on AI training data; search crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Bingbot, Googlebot) get
Allow: /. The two decisions are independent and should not be coupled by accident. - Decide on user-triggered fetchers explicitly. ChatGPT-User, Claude-User, and Perplexity-User fire when a real user is asking the assistant about your specific page. The cost of blocking them is "the assistant tells the user it cannot read your page," in front of that user, in real time. Almost no site wants this to happen unless they are deliberately blocking AI assistants entirely.
- Confirm the bot list against an authoritative source. Vendor docs are the source of truth: OpenAI's crawler docs, Anthropic's bot documentation, Perplexity crawler docs, Google's crawler list, and Apple's Applebot docs. Anything you have not verified against the vendor in the last six months is a candidate for re-checking.
- Run an audit. Our audit flags the AI-search bots that have no explicit rule in your robots.txt and tells you which dimension of AI visibility the omission affects. The bot list is part of the catalog we maintain against vendor documentation; when we add a newly-documented bot to the catalog, every page audited from that point on starts checking it. If your robots.txt has not been touched since 2024, the audit will surface every new bot in one report.
What This Means for Your Q3 2026 Roadmap
Three concrete things worth queuing.
Make the training-vs-search decision explicit. If you do not have separate User-agent: sections for the training bots and the search bots, you are coupling two independent decisions and probably under- or over-blocking on at least one of them. The split takes ten minutes to write and resolves a class of ambiguity that has only gotten worse as vendors have splintered their crawlers.
Add the four 2026-documented user-agents (Claude-SearchBot, Claude-User, OAI-AdsBot, plus an explicit Perplexity-User decision). Even if your policy is "default-allow," the explicit rule is the documentation of the decision. Site owners six months from now — including future you — will not have to reverse-engineer the wildcard inheritance to understand what is governing what.
Re-check the catalog twice a year. The bot landscape was stable from roughly mid-2024 through late 2025 and has been moving steadily through Q1-Q2 2026.
Update — shipped 2026-05-25. Five of the eight newly-documented or rapidly-growing crawlers are now part of our default audit catalog: Claude-SearchBot, Claude-User, Applebot-Extended, Perplexity-User, and OAI-AdsBot. Re-audit any site to see them surface alongside the existing ten. The three not yet added — Amazonbot, Meta-ExternalAgent, MistralAI-User — are held for the Q4 review pending traffic-share data we trust; none currently appear in the Cloudflare Q1 2026 top-six blocked ranking. The threshold for inclusion is unchanged: documented user-agent in the vendor's own docs, plus evidence of measurable user-visible impact when blocked.
The robots.txt file is one of the cheapest configuration assets a site owns. It is also one of the most-neglected. The default of "implicit consent through silence" worked when there were three crawlers to think about. With eleven-plus, the silence is doing more work than most site owners want it to.
Want to see which AI crawlers have explicit rules on your site and which are being governed by inheritance? Run a free audit at hybridranking.com. The bot-access check is part of the AI-visibility surface we evaluate, and the catalog is refreshed against vendor docs as new tokens are published.
Sources
- Anthropic clarifies its three Claude bots — Search Engine Land
- Anthropic Updates Its Crawler Documentation — SE Roundtable
- OpenAI's Crawler Docs Now List OAI-AdsBot — Search Engine Journal
- OpenAI's new OAI-AdsBot is quietly crawling landing pages — PPC Land
- Robots.txt Analysis Across Cloudflare's Network, Q1 2026 — TechnologyChecker
- Web Traffic Statistics Q1 2026 — TechnologyChecker
- Major Publishers Opt Out of Apple's AI Scraping — OpenTools
- Perplexity Crawlers — Perplexity official documentation
- Perplexity using stealth undeclared crawlers — Cloudflare blog
- AI User-Agent Landscape 2026 — No Hacks reference