# The AI Crawler Registry

> Canonical reference of the AI crawlers and agent user-agents on the web (June 2026). 'purpose' is what the operator says the bot does. 'verify' is how to confirm a request claiming this UA is genuine — user-agent strings are trivially spoofed, so verification is by published IP ranges or reverse DNS. No IP addresses are listed here; we link to each operator's authoritative range file instead. This enriched edition backfills the 18 existing records to the full 25-attribute EAV depth defined in research/briefs/crawlers.md, plus S-P-O relationship triples. Every sourced value carries its primary 'source' URL and 'last_verified'; any value not confirmable from a primary source is recorded as a structured placeholder ({value:null, verify_status:'verify-against-primary-at-build', source_hint:<url>}) rather than fabricated. Bot-type enum = the cited 6-type set {training, search-index, user-action-fetcher, opt-out-token, agentic-browser, undocumented} + the Agents Welcome 'data-provider' extension (flagged as such).
> Updated 2026-06-15. JSON: /api/crawlers · single record: /api/crawlers/{id}
> Verify any UA: /api/verify-crawler?ua=<string>

## ClaudeBot (claudebot)

- **Operator:** Anthropic
- **Purpose:** training
- **robots.txt token:** `ClaudeBot`
- **Honors robots.txt:** yes
- **Verify:** reverse DNS (Anthropic does not publish an IP-range file; confirm the PTR resolves to an Anthropic-controlled host)
- **Notes:** Crawls content used to train Claude. Honors robots.txt and crawl-delay.

## Claude-User (claude-user)

- **Operator:** Anthropic
- **Purpose:** inference
- **robots.txt token:** `Claude-User`
- **Honors robots.txt:** yes
- **Verify:** reverse DNS to an Anthropic host
- **Notes:** Fetches a page in real time when a Claude user's prompt references it. User-initiated, not bulk crawling.

## Claude-SearchBot (claude-searchbot)

- **Operator:** Anthropic
- **Purpose:** search
- **robots.txt token:** `Claude-SearchBot`
- **Honors robots.txt:** yes
- **Verify:** reverse DNS to an Anthropic host
- **Notes:** Indexes pages to power Claude's search results.

## GPTBot (gptbot)

- **Operator:** OpenAI
- **Purpose:** training
- **robots.txt token:** `GPTBot`
- **Honors robots.txt:** yes
- **Verify:** published IP ranges at openai.com/gptbot-ranges.json
- **Notes:** Crawls content that may be used to train OpenAI models.

## OAI-SearchBot (oai-searchbot)

- **Operator:** OpenAI
- **Purpose:** search
- **robots.txt token:** `OAI-SearchBot`
- **Honors robots.txt:** yes
- **Verify:** published IP ranges (openai.com publishes searchbot ranges)
- **Notes:** Surfaces and links sites in ChatGPT search. Does not train models.

## ChatGPT-User (chatgpt-user)

- **Operator:** OpenAI
- **Purpose:** inference
- **robots.txt token:** `ChatGPT-User`
- **Honors robots.txt:** yes
- **Verify:** published IP ranges (openai.com/chatgpt-user.json)
- **Notes:** User-triggered fetch when a ChatGPT user or a GPT action requests a specific URL.

## PerplexityBot (perplexitybot)

- **Operator:** Perplexity
- **Purpose:** search
- **robots.txt token:** `PerplexityBot`
- **Honors robots.txt:** yes
- **Verify:** published IP ranges (perplexity.ai publishes perplexitybot ranges)
- **Notes:** Indexes pages so they can be cited as sources in Perplexity answers.

## Perplexity-User (perplexity-user)

- **Operator:** Perplexity
- **Purpose:** inference
- **robots.txt token:** `Perplexity-User`
- **Honors robots.txt:** no
- **Verify:** published IP ranges (perplexity.ai)
- **Notes:** Real-time fetch in response to a user question. Per Perplexity, user-initiated fetches are not treated as automated crawling and may ignore robots.txt — verify and rate-limit at the edge if that matters to you.

## Google-Extended (google-extended)

- **Operator:** Google
- **Purpose:** training
- **robots.txt token:** `Google-Extended`
- **Honors robots.txt:** yes
- **Verify:** not applicable — makes no HTTP requests
- **Notes:** A robots.txt policy token, NOT a crawler. It makes no requests and never appears in logs; disallowing it opts your content out of Gemini/Vertex training while leaving Google Search crawling untouched.

## GoogleOther (googleother)

- **Operator:** Google
- **Purpose:** search
- **robots.txt token:** `GoogleOther`
- **Honors robots.txt:** yes
- **Verify:** Google IP ranges at gstatic.com/ipranges/goog.json + reverse DNS to googlebot.com / google.com
- **Notes:** Generic Google crawler used by various teams for research and product development.

## Google-CloudVertexBot / Gemini agents (gemini-deep-research)

- **Operator:** Google
- **Purpose:** inference
- **robots.txt token:** `Google-CloudVertexBot`
- **Honors robots.txt:** yes
- **Verify:** Google IP ranges (gstatic.com/ipranges)
- **Notes:** Fetches site content on behalf of Vertex AI agents built by site owners.

## Bingbot (bingbot)

- **Operator:** Microsoft
- **Purpose:** search
- **robots.txt token:** `Bingbot`
- **Honors robots.txt:** yes
- **Verify:** reverse DNS to search.msn.com + forward-confirm; Bing publishes a verification tool and IP list
- **Notes:** Powers Bing and, by extension, Copilot search grounding.

## Amazonbot (amazonbot)

- **Operator:** Amazon
- **Purpose:** search
- **robots.txt token:** `Amazonbot`
- **Honors robots.txt:** yes
- **Verify:** reverse DNS to crawl.amazonbot.amazon + Amazon's published ranges
- **Notes:** Improves Alexa answers and supports Amazon's AI products.

## Applebot-Extended (applebot-extended)

- **Operator:** Apple
- **Purpose:** training
- **robots.txt token:** `Applebot-Extended`
- **Honors robots.txt:** yes
- **Verify:** not applicable — policy token; the underlying Applebot verifies via reverse DNS to applebot.apple.com
- **Notes:** Policy token: disallowing it opts content out of Apple Intelligence / foundation-model training without blocking Applebot's search crawling.

## Meta-ExternalAgent (meta-externalagent)

- **Operator:** Meta
- **Purpose:** training
- **robots.txt token:** `meta-externalagent`
- **Honors robots.txt:** yes
- **Verify:** Meta publishes crawler IP ranges; confirm against those
- **Notes:** Crawls content to train Meta's Llama models and AI products.

## CCBot (ccbot)

- **Operator:** Common Crawl
- **Purpose:** training
- **robots.txt token:** `CCBot`
- **Honors robots.txt:** yes
- **Verify:** Common Crawl publishes its crawler IP ranges
- **Notes:** Builds the open Common Crawl corpus that many model trainers ingest downstream. Blocking CCBot blocks an upstream training-data source for the whole ecosystem.

## Bytespider (bytespider)

- **Operator:** ByteDance
- **Purpose:** training
- **robots.txt token:** `Bytespider`
- **Honors robots.txt:** no
- **Verify:** no authoritative published range file; treat unverified Bytespider traffic with suspicion
- **Notes:** Has a reputation for aggressive crawling and inconsistent robots.txt adherence. Rate-limit at the edge if it causes load.

## DuckAssistBot (duckassistbot)

- **Operator:** DuckDuckGo
- **Purpose:** inference
- **robots.txt token:** `DuckAssistBot`
- **Honors robots.txt:** yes
- **Verify:** DuckDuckGo publishes bot details; confirm against those
- **Notes:** Fetches content for DuckDuckGo's AI assist answers.

## OAI-AdsBot (oai-adsbot)

- **Operator:** OpenAI
- **Purpose:** ad-verification
- **robots.txt token:** `OAI-AdsBot`
- **Honors robots.txt:** yes
- **Verify:** published IP ranges (OpenAI publishes per-bot range files); confirm against the OpenAI bots documentation
- **Notes:** Validates ad landing pages for OpenAI's advertising products. Listed alongside GPTBot/OAI-SearchBot/ChatGPT-User in OpenAI's bots documentation.

## Google-Agent (google-agent)

- **Operator:** Google
- **Purpose:** inference
- **robots.txt token:** `Google-Agent`
- **Honors robots.txt:** no
- **Verify:** Google IP ranges (user-triggered-agents.json) + reverse DNS to google.com / googleusercontent.com
- **Notes:** User-triggered fetcher used by agents hosted on Google infrastructure to navigate the web and perform actions on a user's request (for example, Project Mariner / Gemini Agent). As a user-triggered fetcher, Google documents that it generally ignores robots.txt rules.

## MistralAI-User (mistralai-user)

- **Operator:** Mistral AI
- **Purpose:** inference
- **robots.txt token:** `MistralAI-User`
- **Honors robots.txt:** yes
- **Verify:** published IP ranges at mistral.ai/mistralai-user-ips.json
- **Notes:** Fetches a page in real time when a Mistral (Le Chat) user's request references it. Per Mistral, the MistralAI-User token governs which sites these user-initiated requests can be made to.

## Diffbot (diffbot)

- **Operator:** Diffbot
- **Purpose:** data-aggregation
- **robots.txt token:** `Diffbot`
- **Honors robots.txt:** yes
- **Verify:** no operator-published authoritative IP-range file confirmed; verify by user-agent + edge controls. Diffbot documents that Crawlbot adheres to robots.txt by default.
- **Notes:** Diffbot's Crawlbot extracts and structures web content into a knowledge graph sold to customers (market intelligence, e-commerce, AI training). Registered as a 'data-provider' (Agents Welcome taxonomy extension). Diffbot documents that crawls adhere to robots.txt (disallow + crawl-delay) by default.

## Diffbot-User (diffbot-user)

- **Operator:** Diffbot
- **Purpose:** inference
- **robots.txt token:** `Diffbot-User`
- **Honors robots.txt:** yes
- **Verify:** no operator-published authoritative IP-range file confirmed; verify by user-agent + edge controls. Diffbot documents the token for on-behalf-of fetches.
- **Notes:** Used for requests made on behalf of human users browsing URLs through Diffbot software, as distinct from Diffbot's proactive Crawlbot. Diffbot documents both 'Diffbot' and 'Diffbot-User' as robots.txt user-agents.

## ImagesiftBot (imagesiftbot)

- **Operator:** ImageSift (Hive)
- **Purpose:** data-aggregation
- **robots.txt token:** `ImagesiftBot`
- **Honors robots.txt:** yes
- **Verify:** verify by user-agent + edge controls; ImageSift documents robots.txt adherence (incl. crawl-delay) and Googlebot-directive fallback. No operator-published IP-range file confirmed.
- **Notes:** Crawls the web for publicly available images, analyzing and indexing them to power ImageSift's web-intelligence products. Operated by ImageSift (a Hive product). Registered as a 'data-provider' (Agents Welcome taxonomy extension).

## ICC-Crawler (icc-crawler)

- **Operator:** NICT (National Institute of Information and Communications Technology)
- **Purpose:** training
- **robots.txt token:** `ICC-Crawler`
- **Honors robots.txt:** yes
- **Verify:** verify by user-agent + edge controls; the ai.robots.txt registry records respects-robots = Yes. No operator-published IP-range file confirmed.
- **Notes:** Crawls data to train and support AI technologies; NICT (Japan) uses the collected data for AI and may provide it to third parties, including commercial companies. Token and operator recorded in the ai.robots.txt machine-readable registry.

## cohere-ai (cohere-ai)

- **Operator:** Cohere
- **Purpose:** inference
- **robots.txt token:** `cohere-ai`
- **Honors robots.txt:** no
- **Verify:** verify by user-agent + edge controls; no operator-published IP-range file confirmed and robots.txt adherence is unclear per the registry.
- **Notes:** Retrieves data to provide responses to user-initiated prompts (Cohere products). Token and operator recorded in the ai.robots.txt machine-readable registry; the registry marks robots.txt respect as 'Unclear at this time'.

## Meta-WebIndexer (meta-webindexer)

- **Operator:** Meta
- **Purpose:** search
- **robots.txt token:** `Meta-WebIndexer`
- **Honors robots.txt:** no
- **Verify:** Meta publishes crawler IP ranges; confirm against those. Meta documents that allowing Meta-WebIndexer in robots.txt lets Meta AI cite and link your content.
- **Notes:** Per Meta's documentation, the Meta-WebIndexer crawler navigates the web to improve Meta AI search result quality; allowing it in robots.txt helps Meta AI cite and link your content in its responses. Token and operator-doc reference recorded in the ai.robots.txt machine-readable registry.

## ChatGPT Atlas (agent mode) (chatgpt-atlas)

- **Operator:** OpenAI
- **Purpose:** agentic-browsing
- **robots.txt token:** `(none — agentic browser; no published robots.txt token)`
- **Honors robots.txt:** no
- **Verify:** no stable user-agent and (per OpenAI enterprise docs) no IP allowlist; an agentic browser is identifiable only by IP/signature/behavior, not by a UA token. Treat as user-driven browser traffic.
- **Notes:** OpenAI's ChatGPT Atlas browser (launched 2025-10-21) embeds ChatGPT into web navigation; its 'agent mode' takes actions on the user's behalf inside the browser. As a local Chromium-based browser it presents like ordinary browser traffic with no stable AI user-agent token — included here per the agentic-browser taxonomy, verifiable by IP/signature only.

## Perplexity Comet (assistant/agent) (perplexity-comet)

- **Operator:** Perplexity
- **Purpose:** agentic-browsing
- **robots.txt token:** `(none — agentic browser; no published robots.txt token)`
- **Honors robots.txt:** no
- **Verify:** no stable user-agent and no verifiable identity layer; Comet runs inside the user's browser session and presents like ordinary Chromium traffic. Distinct from PerplexityBot/Perplexity-User (which are cloud bots verifiable by IP range + perplexity.ai in the UA).
- **Notes:** Perplexity's Comet is a Chromium-based browser fork that runs locally and performs multi-tab agentic actions inside the user's session. Unlike Perplexity's cloud crawlers, it has no verifiable identity layer at the network level — included here per the agentic-browser taxonomy, verifiable by IP/signature only.

## OpenAI Operator (Computer-Using Agent) (openai-operator)

- **Operator:** OpenAI
- **Purpose:** agentic-browsing
- **robots.txt token:** `(none — agentic browser/agent; no published robots.txt token)`
- **Honors robots.txt:** no
- **Verify:** no stable user-agent token; an agentic browser is identifiable only by IP/signature/behavior, not by a UA token.
- **Notes:** OpenAI's Operator (released 2025-01-23) was a browsing agent powered by the Computer-Using Agent (CUA) model that performed online tasks in a browser on the user's behalf. It was deprecated after the release of ChatGPT agent and shut down on 2025-08-31. Retained here as a deprecated agentic-browser record for history/freshness.

## Project Mariner (project-mariner)

- **Operator:** Google
- **Purpose:** agentic-browsing
- **robots.txt token:** `(none — agentic browser; no published robots.txt token; successor Google-Agent carries a token)`
- **Honors robots.txt:** no
- **Verify:** no stable user-agent token for the standalone product; identifiable only by IP/signature/behavior. Its functionality moved into the Google-Agent fetcher, which is verifiable via user-triggered-agents.json + reverse DNS to google.com.
- **Notes:** Google's Project Mariner (introduced Dec 2024 with Gemini 2.0) was an experimental web-browsing agent that navigated pages and took actions on a user's behalf via a Chrome extension. Google shut it down as a standalone product on 2026-05-04; its features moved into the Gemini API and Gemini Agent (see the Google-Agent record). Retained here as a deprecated agentic-browser record for history/freshness.

_User-agent strings are trivially spoofed. Real verification is by published IP ranges, reverse DNS, or HTTP message signatures (Web Bot Auth)._
