GET /api/crawlers · 31 bots · updated 2026-06-15
The AI Crawler Registry
Every AI crawler and agent user-agent worth knowing, with what it's for, the token it answers to in robots.txt, whether it honors your rules, and — the part most lists skip — how to verify it, because user-agent strings are trivially spoofed.
filter
| Crawler | Purpose | robots.txt token | robots.txt | How to verify |
|---|---|---|---|---|
| ClaudeBotAnthropic | training | ClaudeBot |
honors | reverse DNS (Anthropic does not publish an IP-range file; confirm the PTR resolves to an Anthropic-controlled host) |
| notes Crawls content used to train Claude. Honors robots.txt and crawl-delay. | ||||
| Claude-UserAnthropic | inference | Claude-User |
honors | reverse DNS to an Anthropic host |
| notes Fetches a page in real time when a Claude user's prompt references it. User-initiated, not bulk crawling. | ||||
| Claude-SearchBotAnthropic | search | Claude-SearchBot |
honors | reverse DNS to an Anthropic host |
| notes Indexes pages to power Claude's search results. | ||||
| GPTBotOpenAI | training | GPTBot |
honors | published IP ranges at openai.com/gptbot-ranges.json |
| notes Crawls content that may be used to train OpenAI models. | ||||
| OAI-SearchBotOpenAI | search | OAI-SearchBot |
honors | published IP ranges (openai.com publishes searchbot ranges) |
| notes Surfaces and links sites in ChatGPT search. Does not train models. | ||||
| ChatGPT-UserOpenAI | inference | ChatGPT-User |
honors | published IP ranges (openai.com/chatgpt-user.json) |
| notes User-triggered fetch when a ChatGPT user or a GPT action requests a specific URL. | ||||
| PerplexityBotPerplexity | search | PerplexityBot |
honors | published IP ranges (perplexity.ai publishes perplexitybot ranges) |
| notes Indexes pages so they can be cited as sources in Perplexity answers. | ||||
| Perplexity-UserPerplexity | inference | Perplexity-User |
ignores | published IP ranges (perplexity.ai) |
| notes Real-time fetch in response to a user question. Per Perplexity, user-initiated fetches are not treated as automated crawling and may ignore robots.txt — verify and rate-limit at the edge if that matters to you. | ||||
| Google-ExtendedGoogle | training | Google-Extended |
honors | not applicable — makes no HTTP requests |
| notes A robots.txt policy token, NOT a crawler. It makes no requests and never appears in logs; disallowing it opts your content out of Gemini/Vertex training while leaving Google Search crawling untouched. | ||||
| GoogleOtherGoogle | search | GoogleOther |
honors | Google IP ranges at gstatic.com/ipranges/goog.json + reverse DNS to googlebot.com / google.com |
| notes Generic Google crawler used by various teams for research and product development. | ||||
| Google-CloudVertexBot / Gemini agentsGoogle | inference | Google-CloudVertexBot |
honors | Google IP ranges (gstatic.com/ipranges) |
| notes Fetches site content on behalf of Vertex AI agents built by site owners. | ||||
| BingbotMicrosoft | search | Bingbot |
honors | reverse DNS to search.msn.com + forward-confirm; Bing publishes a verification tool and IP list |
| notes Powers Bing and, by extension, Copilot search grounding. | ||||
| AmazonbotAmazon | search | Amazonbot |
honors | reverse DNS to crawl.amazonbot.amazon + Amazon's published ranges |
| notes Improves Alexa answers and supports Amazon's AI products. | ||||
| Applebot-ExtendedApple | training | Applebot-Extended |
honors | not applicable — policy token; the underlying Applebot verifies via reverse DNS to applebot.apple.com |
| notes Policy token: disallowing it opts content out of Apple Intelligence / foundation-model training without blocking Applebot's search crawling. | ||||
| Meta-ExternalAgentMeta | training | meta-externalagent |
honors | Meta publishes crawler IP ranges; confirm against those |
| notes Crawls content to train Meta's Llama models and AI products. | ||||
| CCBotCommon Crawl | training | CCBot |
honors | Common Crawl publishes its crawler IP ranges |
| notes Builds the open Common Crawl corpus that many model trainers ingest downstream. Blocking CCBot blocks an upstream training-data source for the whole ecosystem. | ||||
| BytespiderByteDance | training | Bytespider |
ignores | no authoritative published range file; treat unverified Bytespider traffic with suspicion |
| notes Has a reputation for aggressive crawling and inconsistent robots.txt adherence. Rate-limit at the edge if it causes load. | ||||
| DuckAssistBotDuckDuckGo | inference | DuckAssistBot |
honors | DuckDuckGo publishes bot details; confirm against those |
| notes Fetches content for DuckDuckGo's AI assist answers. | ||||
| OAI-AdsBotOpenAI | OAI-AdsBot |
honors | published IP ranges (OpenAI publishes per-bot range files); confirm against the OpenAI bots documentation | |
| notes Validates ad landing pages for OpenAI's advertising products. Listed alongside GPTBot/OAI-SearchBot/ChatGPT-User in OpenAI's bots documentation. | ||||
| Google-AgentGoogle | inference | Google-Agent |
ignores | Google IP ranges (user-triggered-agents.json) + reverse DNS to google.com / googleusercontent.com |
| notes User-triggered fetcher used by agents hosted on Google infrastructure to navigate the web and perform actions on a user's request (for example, Project Mariner / Gemini Agent). As a user-triggered fetcher, Google documents that it generally ignores robots.txt rules. | ||||
| MistralAI-UserMistral AI | inference | MistralAI-User |
honors | published IP ranges at mistral.ai/mistralai-user-ips.json |
| notes Fetches a page in real time when a Mistral (Le Chat) user's request references it. Per Mistral, the MistralAI-User token governs which sites these user-initiated requests can be made to. | ||||
| DiffbotDiffbot | data-aggregation | Diffbot |
honors | no operator-published authoritative IP-range file confirmed; verify by user-agent + edge controls. Diffbot documents that Crawlbot adheres to robots.txt by default. |
| notes Diffbot's Crawlbot extracts and structures web content into a knowledge graph sold to customers (market intelligence, e-commerce, AI training). Registered as a 'data-provider' (Agents Welcome taxonomy extension). Diffbot documents that crawls adhere to robots.txt (disallow + crawl-delay) by default. | ||||
| Diffbot-UserDiffbot | inference | Diffbot-User |
honors | no operator-published authoritative IP-range file confirmed; verify by user-agent + edge controls. Diffbot documents the token for on-behalf-of fetches. |
| notes Used for requests made on behalf of human users browsing URLs through Diffbot software, as distinct from Diffbot's proactive Crawlbot. Diffbot documents both 'Diffbot' and 'Diffbot-User' as robots.txt user-agents. | ||||
| ImagesiftBotImageSift (Hive) | data-aggregation | ImagesiftBot |
honors | verify by user-agent + edge controls; ImageSift documents robots.txt adherence (incl. crawl-delay) and Googlebot-directive fallback. No operator-published IP-range file confirmed. |
| notes Crawls the web for publicly available images, analyzing and indexing them to power ImageSift's web-intelligence products. Operated by ImageSift (a Hive product). Registered as a 'data-provider' (Agents Welcome taxonomy extension). | ||||
| ICC-CrawlerNICT (National Institute of Information and Communications Technology) | training | ICC-Crawler |
honors | verify by user-agent + edge controls; the ai.robots.txt registry records respects-robots = Yes. No operator-published IP-range file confirmed. |
| notes Crawls data to train and support AI technologies; NICT (Japan) uses the collected data for AI and may provide it to third parties, including commercial companies. Token and operator recorded in the ai.robots.txt machine-readable registry. | ||||
| cohere-aiCohere | inference | cohere-ai |
ignores | verify by user-agent + edge controls; no operator-published IP-range file confirmed and robots.txt adherence is unclear per the registry. |
| notes Retrieves data to provide responses to user-initiated prompts (Cohere products). Token and operator recorded in the ai.robots.txt machine-readable registry; the registry marks robots.txt respect as 'Unclear at this time'. | ||||
| Meta-WebIndexerMeta | search | Meta-WebIndexer |
ignores | Meta publishes crawler IP ranges; confirm against those. Meta documents that allowing Meta-WebIndexer in robots.txt lets Meta AI cite and link your content. |
| notes Per Meta's documentation, the Meta-WebIndexer crawler navigates the web to improve Meta AI search result quality; allowing it in robots.txt helps Meta AI cite and link your content in its responses. Token and operator-doc reference recorded in the ai.robots.txt machine-readable registry. | ||||
| ChatGPT Atlas (agent mode)OpenAI | agentic-browsing | (none — agentic browser; no published robots.txt token) |
ignores | no stable user-agent and (per OpenAI enterprise docs) no IP allowlist; an agentic browser is identifiable only by IP/signature/behavior, not by a UA token. Treat as user-driven browser traffic. |
| notes OpenAI's ChatGPT Atlas browser (launched 2025-10-21) embeds ChatGPT into web navigation; its 'agent mode' takes actions on the user's behalf inside the browser. As a local Chromium-based browser it presents like ordinary browser traffic with no stable AI user-agent token — included here per the agentic-browser taxonomy, verifiable by IP/signature only. | ||||
| Perplexity Comet (assistant/agent)Perplexity | agentic-browsing | (none — agentic browser; no published robots.txt token) |
ignores | no stable user-agent and no verifiable identity layer; Comet runs inside the user's browser session and presents like ordinary Chromium traffic. Distinct from PerplexityBot/Perplexity-User (which are cloud bots verifiable by IP range + perplexity.ai in the UA). |
| notes Perplexity's Comet is a Chromium-based browser fork that runs locally and performs multi-tab agentic actions inside the user's session. Unlike Perplexity's cloud crawlers, it has no verifiable identity layer at the network level — included here per the agentic-browser taxonomy, verifiable by IP/signature only. | ||||
| OpenAI Operator (Computer-Using Agent)OpenAI | agentic-browsing | (none — agentic browser/agent; no published robots.txt token) |
ignores | no stable user-agent token; an agentic browser is identifiable only by IP/signature/behavior, not by a UA token. |
| notes OpenAI's Operator (released 2025-01-23) was a browsing agent powered by the Computer-Using Agent (CUA) model that performed online tasks in a browser on the user's behalf. It was deprecated after the release of ChatGPT agent and shut down on 2025-08-31. Retained here as a deprecated agentic-browser record for history/freshness. | ||||
| Project MarinerGoogle | agentic-browsing | (none — agentic browser; no published robots.txt token; successor Google-Agent carries a token) |
ignores | no stable user-agent token for the standalone product; identifiable only by IP/signature/behavior. Its functionality moved into the Google-Agent fetcher, which is verifiable via user-triggered-agents.json + reverse DNS to google.com. |
| notes Google's Project Mariner (introduced Dec 2024 with Gemini 2.0) was an experimental web-browsing agent that navigated pages and took actions on a user's behalf via a Chrome extension. Google shut it down as a standalone product on 2026-05-04; its features moved into the Gemini API and Gemini Agent (see the Google-Agent record). Retained here as a deprecated agentic-browser record for history/freshness. | ||||
why no IP addresses? Operator IP ranges change; printing a stale list is worse than none. Each row links to the operator's authoritative method instead — published range file or reverse DNS. For cryptographic proof of identity, see Web Bot Auth.
