GET /api/crawlers · 31 bots · updated 2026-06-15

The AI Crawler Registry

Every AI crawler and agent user-agent worth knowing, with what it's for, the token it answers to in robots.txt, whether it honors your rules, and — the part most lists skip — how to verify it, because user-agent strings are trivially spoofed.

filter
CrawlerPurposerobots.txt tokenrobots.txtHow to verify
ClaudeBotAnthropic training ClaudeBot honors reverse DNS (Anthropic does not publish an IP-range file; confirm the PTR resolves to an Anthropic-controlled host)
notes Crawls content used to train Claude. Honors robots.txt and crawl-delay.
Claude-UserAnthropic inference Claude-User honors reverse DNS to an Anthropic host
notes Fetches a page in real time when a Claude user's prompt references it. User-initiated, not bulk crawling.
Claude-SearchBotAnthropic search Claude-SearchBot honors reverse DNS to an Anthropic host
notes Indexes pages to power Claude's search results.
GPTBotOpenAI training GPTBot honors published IP ranges at openai.com/gptbot-ranges.json
notes Crawls content that may be used to train OpenAI models.
OAI-SearchBotOpenAI search OAI-SearchBot honors published IP ranges (openai.com publishes searchbot ranges)
notes Surfaces and links sites in ChatGPT search. Does not train models.
ChatGPT-UserOpenAI inference ChatGPT-User honors published IP ranges (openai.com/chatgpt-user.json)
notes User-triggered fetch when a ChatGPT user or a GPT action requests a specific URL.
PerplexityBotPerplexity search PerplexityBot honors published IP ranges (perplexity.ai publishes perplexitybot ranges)
notes Indexes pages so they can be cited as sources in Perplexity answers.
Perplexity-UserPerplexity inference Perplexity-User ignores published IP ranges (perplexity.ai)
notes Real-time fetch in response to a user question. Per Perplexity, user-initiated fetches are not treated as automated crawling and may ignore robots.txt — verify and rate-limit at the edge if that matters to you.
Google-ExtendedGoogle training Google-Extended honors not applicable — makes no HTTP requests
notes A robots.txt policy token, NOT a crawler. It makes no requests and never appears in logs; disallowing it opts your content out of Gemini/Vertex training while leaving Google Search crawling untouched.
GoogleOtherGoogle search GoogleOther honors Google IP ranges at gstatic.com/ipranges/goog.json + reverse DNS to googlebot.com / google.com
notes Generic Google crawler used by various teams for research and product development.
Google-CloudVertexBot / Gemini agentsGoogle inference Google-CloudVertexBot honors Google IP ranges (gstatic.com/ipranges)
notes Fetches site content on behalf of Vertex AI agents built by site owners.
BingbotMicrosoft search Bingbot honors reverse DNS to search.msn.com + forward-confirm; Bing publishes a verification tool and IP list
notes Powers Bing and, by extension, Copilot search grounding.
AmazonbotAmazon search Amazonbot honors reverse DNS to crawl.amazonbot.amazon + Amazon's published ranges
notes Improves Alexa answers and supports Amazon's AI products.
Applebot-ExtendedApple training Applebot-Extended honors not applicable — policy token; the underlying Applebot verifies via reverse DNS to applebot.apple.com
notes Policy token: disallowing it opts content out of Apple Intelligence / foundation-model training without blocking Applebot's search crawling.
Meta-ExternalAgentMeta training meta-externalagent honors Meta publishes crawler IP ranges; confirm against those
notes Crawls content to train Meta's Llama models and AI products.
CCBotCommon Crawl training CCBot honors Common Crawl publishes its crawler IP ranges
notes Builds the open Common Crawl corpus that many model trainers ingest downstream. Blocking CCBot blocks an upstream training-data source for the whole ecosystem.
BytespiderByteDance training Bytespider ignores no authoritative published range file; treat unverified Bytespider traffic with suspicion
notes Has a reputation for aggressive crawling and inconsistent robots.txt adherence. Rate-limit at the edge if it causes load.
DuckAssistBotDuckDuckGo inference DuckAssistBot honors DuckDuckGo publishes bot details; confirm against those
notes Fetches content for DuckDuckGo's AI assist answers.
OAI-AdsBotOpenAI ad-verification OAI-AdsBot honors published IP ranges (OpenAI publishes per-bot range files); confirm against the OpenAI bots documentation
notes Validates ad landing pages for OpenAI's advertising products. Listed alongside GPTBot/OAI-SearchBot/ChatGPT-User in OpenAI's bots documentation.
Google-AgentGoogle inference Google-Agent ignores Google IP ranges (user-triggered-agents.json) + reverse DNS to google.com / googleusercontent.com
notes User-triggered fetcher used by agents hosted on Google infrastructure to navigate the web and perform actions on a user's request (for example, Project Mariner / Gemini Agent). As a user-triggered fetcher, Google documents that it generally ignores robots.txt rules.
MistralAI-UserMistral AI inference MistralAI-User honors published IP ranges at mistral.ai/mistralai-user-ips.json
notes Fetches a page in real time when a Mistral (Le Chat) user's request references it. Per Mistral, the MistralAI-User token governs which sites these user-initiated requests can be made to.
DiffbotDiffbot data-aggregation Diffbot honors no operator-published authoritative IP-range file confirmed; verify by user-agent + edge controls. Diffbot documents that Crawlbot adheres to robots.txt by default.
notes Diffbot's Crawlbot extracts and structures web content into a knowledge graph sold to customers (market intelligence, e-commerce, AI training). Registered as a 'data-provider' (Agents Welcome taxonomy extension). Diffbot documents that crawls adhere to robots.txt (disallow + crawl-delay) by default.
Diffbot-UserDiffbot inference Diffbot-User honors no operator-published authoritative IP-range file confirmed; verify by user-agent + edge controls. Diffbot documents the token for on-behalf-of fetches.
notes Used for requests made on behalf of human users browsing URLs through Diffbot software, as distinct from Diffbot's proactive Crawlbot. Diffbot documents both 'Diffbot' and 'Diffbot-User' as robots.txt user-agents.
ImagesiftBotImageSift (Hive) data-aggregation ImagesiftBot honors verify by user-agent + edge controls; ImageSift documents robots.txt adherence (incl. crawl-delay) and Googlebot-directive fallback. No operator-published IP-range file confirmed.
notes Crawls the web for publicly available images, analyzing and indexing them to power ImageSift's web-intelligence products. Operated by ImageSift (a Hive product). Registered as a 'data-provider' (Agents Welcome taxonomy extension).
ICC-CrawlerNICT (National Institute of Information and Communications Technology) training ICC-Crawler honors verify by user-agent + edge controls; the ai.robots.txt registry records respects-robots = Yes. No operator-published IP-range file confirmed.
notes Crawls data to train and support AI technologies; NICT (Japan) uses the collected data for AI and may provide it to third parties, including commercial companies. Token and operator recorded in the ai.robots.txt machine-readable registry.
cohere-aiCohere inference cohere-ai ignores verify by user-agent + edge controls; no operator-published IP-range file confirmed and robots.txt adherence is unclear per the registry.
notes Retrieves data to provide responses to user-initiated prompts (Cohere products). Token and operator recorded in the ai.robots.txt machine-readable registry; the registry marks robots.txt respect as 'Unclear at this time'.
Meta-WebIndexerMeta search Meta-WebIndexer ignores Meta publishes crawler IP ranges; confirm against those. Meta documents that allowing Meta-WebIndexer in robots.txt lets Meta AI cite and link your content.
notes Per Meta's documentation, the Meta-WebIndexer crawler navigates the web to improve Meta AI search result quality; allowing it in robots.txt helps Meta AI cite and link your content in its responses. Token and operator-doc reference recorded in the ai.robots.txt machine-readable registry.
ChatGPT Atlas (agent mode)OpenAI agentic-browsing (none — agentic browser; no published robots.txt token) ignores no stable user-agent and (per OpenAI enterprise docs) no IP allowlist; an agentic browser is identifiable only by IP/signature/behavior, not by a UA token. Treat as user-driven browser traffic.
notes OpenAI's ChatGPT Atlas browser (launched 2025-10-21) embeds ChatGPT into web navigation; its 'agent mode' takes actions on the user's behalf inside the browser. As a local Chromium-based browser it presents like ordinary browser traffic with no stable AI user-agent token — included here per the agentic-browser taxonomy, verifiable by IP/signature only.
Perplexity Comet (assistant/agent)Perplexity agentic-browsing (none — agentic browser; no published robots.txt token) ignores no stable user-agent and no verifiable identity layer; Comet runs inside the user's browser session and presents like ordinary Chromium traffic. Distinct from PerplexityBot/Perplexity-User (which are cloud bots verifiable by IP range + perplexity.ai in the UA).
notes Perplexity's Comet is a Chromium-based browser fork that runs locally and performs multi-tab agentic actions inside the user's session. Unlike Perplexity's cloud crawlers, it has no verifiable identity layer at the network level — included here per the agentic-browser taxonomy, verifiable by IP/signature only.
OpenAI Operator (Computer-Using Agent)OpenAI agentic-browsing (none — agentic browser/agent; no published robots.txt token) ignores no stable user-agent token; an agentic browser is identifiable only by IP/signature/behavior, not by a UA token.
notes OpenAI's Operator (released 2025-01-23) was a browsing agent powered by the Computer-Using Agent (CUA) model that performed online tasks in a browser on the user's behalf. It was deprecated after the release of ChatGPT agent and shut down on 2025-08-31. Retained here as a deprecated agentic-browser record for history/freshness.
Project MarinerGoogle agentic-browsing (none — agentic browser; no published robots.txt token; successor Google-Agent carries a token) ignores no stable user-agent token for the standalone product; identifiable only by IP/signature/behavior. Its functionality moved into the Google-Agent fetcher, which is verifiable via user-triggered-agents.json + reverse DNS to google.com.
notes Google's Project Mariner (introduced Dec 2024 with Gemini 2.0) was an experimental web-browsing agent that navigated pages and took actions on a user's behalf via a Chrome extension. Google shut it down as a standalone product on 2026-05-04; its features moved into the Gemini API and Gemini Agent (see the Google-Agent record). Retained here as a deprecated agentic-browser record for history/freshness.

why no IP addresses? Operator IP ranges change; printing a stale list is worse than none. Each row links to the operator's authoritative method instead — published range file or reverse DNS. For cryptographic proof of identity, see Web Bot Auth.