GET /api/crawlers · 41 bots · updated 2026-07-06

The AI Crawler Registry

Every AI crawler and agent user-agent worth knowing, with what it's for, the token it answers to in robots.txt, whether it honors your rules, and — the part most lists skip — how to verify it, because user-agent strings are trivially spoofed.

filter

Crawler	Purpose	robots.txt token	robots.txt	How to verify
ClaudeBotAnthropic	training	`ClaudeBot`	honors	reverse DNS (Anthropic does not publish an IP-range file; confirm the PTR resolves to an Anthropic-controlled host)
notes Crawls content used to train Claude. Honors robots.txt and crawl-delay.
Claude-UserAnthropic	inference	`Claude-User`	honors	reverse DNS to an Anthropic host
notes Fetches a page in real time when a Claude user's prompt references it. User-initiated, not bulk crawling.
Claude-SearchBotAnthropic	search	`Claude-SearchBot`	honors	reverse DNS to an Anthropic host
notes Indexes pages to power Claude's search results.
GPTBotOpenAI	training	`GPTBot`	honors	published IP ranges at openai.com/gptbot-ranges.json
notes Crawls content that may be used to train OpenAI models.
OAI-SearchBotOpenAI	search	`OAI-SearchBot`	honors	published IP ranges (openai.com publishes searchbot ranges)
notes Surfaces and links sites in ChatGPT search. Does not train models.
ChatGPT-UserOpenAI	inference	`ChatGPT-User`	honors	published IP ranges (openai.com/chatgpt-user.json)
notes User-triggered fetch when a ChatGPT user or a GPT action requests a specific URL.
PerplexityBotPerplexity	search	`PerplexityBot`	honors	published IP ranges (perplexity.ai publishes perplexitybot ranges)
notes Indexes pages so they can be cited as sources in Perplexity answers.
Perplexity-UserPerplexity	inference	`Perplexity-User`	ignores	published IP ranges (perplexity.ai)
notes Real-time fetch in response to a user question. Per Perplexity, user-initiated fetches are not treated as automated crawling and may ignore robots.txt — verify and rate-limit at the edge if that matters to you.
Google-ExtendedGoogle	training	`Google-Extended`	honors	not applicable — makes no HTTP requests
notes A robots.txt policy token, NOT a crawler. It makes no requests and never appears in logs; disallowing it opts your content out of Gemini/Vertex training while leaving Google Search crawling untouched.
GoogleOtherGoogle	search	`GoogleOther`	honors	Google IP ranges at gstatic.com/ipranges/goog.json + reverse DNS to googlebot.com / google.com
notes Generic Google crawler used by various teams for research and product development.
Google-CloudVertexBot / Gemini agentsGoogle	inference	`Google-CloudVertexBot`	honors	Google IP ranges (gstatic.com/ipranges)
notes Fetches site content on behalf of Vertex AI agents built by site owners.
BingbotMicrosoft	search	`Bingbot`	honors	reverse DNS to search.msn.com + forward-confirm; Bing publishes a verification tool and IP list
notes Powers Bing and, by extension, Copilot search grounding.
AmazonbotAmazon	search	`Amazonbot`	honors	reverse DNS to crawl.amazonbot.amazon + Amazon's published ranges
notes Improves Alexa answers and supports Amazon's AI products.
Applebot-ExtendedApple	training	`Applebot-Extended`	honors	not applicable — policy token; the underlying Applebot verifies via reverse DNS to applebot.apple.com
notes Policy token: disallowing it opts content out of Apple Intelligence / foundation-model training without blocking Applebot's search crawling.
Meta-ExternalAgentMeta	training	`meta-externalagent`	honors	Meta publishes crawler IP ranges; confirm against those
notes Crawls content to train Meta's Llama models and AI products.
CCBotCommon Crawl	training	`CCBot`	honors	Common Crawl publishes its crawler IP ranges
notes Builds the open Common Crawl corpus that many model trainers ingest downstream. Blocking CCBot blocks an upstream training-data source for the whole ecosystem.
BytespiderByteDance	training	`Bytespider`	ignores	no authoritative published range file; treat unverified Bytespider traffic with suspicion
notes Has a reputation for aggressive crawling and inconsistent robots.txt adherence. Rate-limit at the edge if it causes load.
DuckAssistBotDuckDuckGo	inference	`DuckAssistBot`	honors	DuckDuckGo publishes bot details; confirm against those
notes Fetches content for DuckDuckGo's AI assist answers.
OAI-AdsBotOpenAI	ad-verification	`OAI-AdsBot`	honors	published IP ranges (OpenAI publishes per-bot range files); confirm against the OpenAI bots documentation
notes Validates ad landing pages for OpenAI's advertising products. Listed alongside GPTBot/OAI-SearchBot/ChatGPT-User in OpenAI's bots documentation.
Google-AgentGoogle	inference	`Google-Agent`	ignores	Google IP ranges (user-triggered-agents.json) + reverse DNS to google.com / googleusercontent.com
notes User-triggered fetcher used by agents hosted on Google infrastructure to navigate the web and perform actions on a user's request (for example, Project Mariner / Gemini Agent). As a user-triggered fetcher, Google documents that it generally ignores robots.txt rules.
MistralAI-UserMistral AI	inference	`MistralAI-User`	honors	published IP ranges at mistral.ai/mistralai-user-ips.json
notes Fetches a page in real time when a Mistral (Le Chat) user's request references it. Per Mistral, the MistralAI-User token governs which sites these user-initiated requests can be made to.
DiffbotDiffbot	data-aggregation	`Diffbot`	honors	no operator-published authoritative IP-range file confirmed; verify by user-agent + edge controls. Diffbot documents that Crawlbot adheres to robots.txt by default.
notes Diffbot's Crawlbot extracts and structures web content into a knowledge graph sold to customers (market intelligence, e-commerce, AI training). Registered as a 'data-provider' (Agents Welcome taxonomy extension). Diffbot documents that crawls adhere to robots.txt (disallow + crawl-delay) by default.
Diffbot-UserDiffbot	inference	`Diffbot-User`	honors	no operator-published authoritative IP-range file confirmed; verify by user-agent + edge controls. Diffbot documents the token for on-behalf-of fetches.
notes Used for requests made on behalf of human users browsing URLs through Diffbot software, as distinct from Diffbot's proactive Crawlbot. Diffbot documents both 'Diffbot' and 'Diffbot-User' as robots.txt user-agents.
ImagesiftBotImageSift (Hive)	data-aggregation	`ImagesiftBot`	honors	verify by user-agent + edge controls; ImageSift documents robots.txt adherence (incl. crawl-delay) and Googlebot-directive fallback. No operator-published IP-range file confirmed.
notes Crawls the web for publicly available images, analyzing and indexing them to power ImageSift's web-intelligence products. Operated by ImageSift (a Hive product). Registered as a 'data-provider' (Agents Welcome taxonomy extension).
ICC-CrawlerNICT (National Institute of Information and Communications Technology)	training	`ICC-Crawler`	honors	verify by user-agent + edge controls; the ai.robots.txt registry records respects-robots = Yes. No operator-published IP-range file confirmed.
notes Crawls data to train and support AI technologies; NICT (Japan) uses the collected data for AI and may provide it to third parties, including commercial companies. Token and operator recorded in the ai.robots.txt machine-readable registry.
cohere-aiCohere	inference	`cohere-ai`	unclear	verify by user-agent + edge controls; no operator-published IP-range file confirmed and robots.txt adherence is unclear per the registry.
notes Retrieves data to provide responses to user-initiated prompts (Cohere products). Token and operator recorded in the ai.robots.txt machine-readable registry; the registry marks robots.txt respect as 'Unclear at this time'.
Meta-WebIndexerMeta	search	`Meta-WebIndexer`	unclear	Meta publishes crawler IP ranges; confirm against those. Meta documents that allowing Meta-WebIndexer in robots.txt lets Meta AI cite and link your content.
notes Per Meta's documentation, the Meta-WebIndexer crawler navigates the web to improve Meta AI search result quality; allowing it in robots.txt helps Meta AI cite and link your content in its responses. Token and operator-doc reference recorded in the ai.robots.txt machine-readable registry.
ChatGPT Atlas (agent mode)OpenAI	agentic-browsing	`(none — agentic browser; no published robots.txt token)`	ignores	no stable user-agent and (per OpenAI enterprise docs) no IP allowlist; an agentic browser is identifiable only by IP/signature/behavior, not by a UA token. Treat as user-driven browser traffic.
notes OpenAI's ChatGPT Atlas browser (launched 2025-10-21) embeds ChatGPT into web navigation; its 'agent mode' takes actions on the user's behalf inside the browser. As a local Chromium-based browser it presents like ordinary browser traffic with no stable AI user-agent token — included here per the agentic-browser taxonomy, verifiable by IP/signature only.
Perplexity Comet (assistant/agent)Perplexity	agentic-browsing	`(none — agentic browser; no published robots.txt token)`	ignores	no stable user-agent and no verifiable identity layer; Comet runs inside the user's browser session and presents like ordinary Chromium traffic. Distinct from PerplexityBot/Perplexity-User (which are cloud bots verifiable by IP range + perplexity.ai in the UA).
notes Perplexity's Comet is a Chromium-based browser fork that runs locally and performs multi-tab agentic actions inside the user's session. Unlike Perplexity's cloud crawlers, it has no verifiable identity layer at the network level — included here per the agentic-browser taxonomy, verifiable by IP/signature only.
OpenAI Operator (Computer-Using Agent)OpenAI	agentic-browsing	`(none — agentic browser/agent; no published robots.txt token)`	ignores	no stable user-agent token; an agentic browser is identifiable only by IP/signature/behavior, not by a UA token.
notes OpenAI's Operator (released 2025-01-23) was a browsing agent powered by the Computer-Using Agent (CUA) model that performed online tasks in a browser on the user's behalf. It was deprecated after the release of ChatGPT agent and shut down on 2025-08-31. Retained here as a deprecated agentic-browser record for history/freshness.
Project MarinerGoogle	agentic-browsing	`(none — agentic browser; no published robots.txt token; successor Google-Agent carries a token)`	ignores	no stable user-agent token for the standalone product; identifiable only by IP/signature/behavior. Its functionality moved into the Google-Agent fetcher, which is verifiable via user-triggered-agents.json + reverse DNS to google.com.
notes Google's Project Mariner (introduced Dec 2024 with Gemini 2.0) was an experimental web-browsing agent that navigated pages and took actions on a user's behalf via a Chrome extension. Google shut it down as a standalone product on 2026-05-04; its features moved into the Gemini API and Gemini Agent (see the Google-Agent record). Retained here as a deprecated agentic-browser record for history/freshness.
ApplebotApple	search	`Applebot`	honors	Apple publishes Applebot IP ranges and documents reverse-DNS verification; confirm the source IP resolves to an Apple-controlled host (see documentation_url).
notes Apple's crawler for Siri and Spotlight Suggestions. The separate token Applebot-Extended is used only to opt out of Apple-Intelligence training without losing search visibility.
meta-externalfetcherMeta	inference	`meta-externalfetcher`	honors	Meta documents its crawlers and honors robots.txt; match the UA and a Meta-owned IP (see documentation_url).
notes Fetches individual links at a user's request to support Meta AI task completion. It is user-triggered, not bulk crawling.
meta-externaladsMeta	ad-verification	`meta-externalads`	honors	Meta-documented crawler; match the UA and a Meta-owned IP (see documentation_url).
notes Crawls the web to improve Meta's advertising and other business products.
AI2BotAllen Institute for AI (Ai2)	training	`AI2Bot`	honors	Ai2 documents the crawler and states its UA string may be used to filter or reject traffic; there is no published IP file, so match the UA (see documentation_url).
notes Collects web content to train Ai2's open language models.
anthropic-aiAnthropic	training	`anthropic-ai`	honors	Legacy token; Anthropic's current crawler is ClaudeBot. Anthropic honors robots.txt and documents its crawlers (see documentation_url).
notes Anthropic's earlier training user-agent, widely blocked in AI robots.txt files. Anthropic's current, documented training crawler is ClaudeBot — prefer targeting ClaudeBot in new rules.
BravebotBrave	search	`Bravebot`	honors	Brave documents its crawler and honors robots.txt; match the UA and a Brave-owned IP (see documentation_url).
notes Crawls to build the independent Brave Search index, which also grounds Brave's AI answers.
kagi-fetcherKagi	inference	`kagi-fetcher`	honors	Kagi honors robots.txt and documents its bots; match the UA (see documentation_url).
notes Fetches pages on demand for Kagi's assistant and summarizer at a user's request; not bulk crawling.
bedrockbotAmazon	inference	`bedrockbot`	honors	Amazon honors robots.txt and documents Bedrock's web crawling; match the UA and an AWS-owned IP (see documentation_url).
notes Fetches web pages for Amazon Bedrock knowledge bases and web-data connectors at a customer's request; retrieval, not bulk training.
cohere-training-data-crawlerCohere	training	`cohere-training-data-crawler`	honors	Cohere honors robots.txt and documents its crawlers; match the UA (see documentation_url). Companion to the cohere-ai token.
notes Cohere's crawler for gathering web content used to train and improve its models.
DuckDuckBotDuckDuckGo	search	`DuckDuckBot`	honors	DuckDuckGo publishes DuckDuckBot IP addresses and documents the crawler; verify against the published list (see documentation_url). Companion to DuckAssistBot.
notes DuckDuckGo's traditional search crawler. The separate DuckAssistBot token serves its AI-assist features.

why no IP addresses? Operator IP ranges change; printing a stale list is worse than none. Each row links to the operator's authoritative method instead — published range file or reverse DNS. For cryptographic proof of identity, see Web Bot Auth.