CCBot

Common Crawl · training

name
CCBot
operator
Common Crawl
purpose
training
ua_substring
CCBot
robots_token
CCBot
respects_robots
yes
verify
Common Crawl publishes its crawler IP ranges
notes
Builds the open Common Crawl corpus that many model trainers ingest downstream. Blocking CCBot blocks an upstream training-data source for the whole ecosystem.
canonical_name
CCBot
user_agent_token
CCBot
ua_full
CCBot/2.0 (https://commoncrawl.org/faq/) source
bot_type
training
bot_type_extension
opt_out_mechanism
robots.txt disallow (User-agent: CCBot)
published_ip_range_url
https://index.commoncrawl.org/ccbot.json
asn
— verify-against-primary-at-build ↗ https://index.commoncrawl.org/ccbot.json
reverse_dns_suffix
.crawl.commoncrawl.org source
supports_web_bot_auth
— verify-against-primary-at-build ↗ https://commoncrawl.org/ccbot
signature_agent_domain
— verify-against-primary-at-build ↗ https://commoncrawl.org/ccbot
jwks_url
— verify-against-primary-at-build ↗ https://commoncrawl.org/ccbot
verification_methods
published-IP-range reverse-DNS
crawl_traffic_share
— verify-against-primary-at-build ↗ https://radar.cloudflare.com/bots
targeted_content_type
HTML, text
documentation_url
https://commoncrawl.org/ccbot
first_seen_date
— verify-against-primary-at-build ↗ https://commoncrawl.org/ccbot
last_verified_date
2026-06-15
block_vs_allow_recommendation
conditional — upstream open-corpus crawler; allowing it feeds many downstream trainers (broad reach), blocking removes you from the Common Crawl corpus. No direct referral.
citation_referral_value
low (open training corpus; no direct citation or referral)
cloudflare_verified_category
— verify-against-primary-at-build ↗ https://radar.cloudflare.com/bots/directory/ccbot
status
active
triples
["CCBot","operated_by","Common Crawl"] ["CCBot","has_bot_type","training"] ["CCBot","verified_via","published-IP-range"] ["CCBot","verified_via","reverse-DNS"]
attribute_sources
{"claims":["ua_full","user_agent_token","robots_token","published_ip_range_url","reverse_dns_suffix","documentation_url"],"source":"https://commoncrawl.org/ccbot","last_verified":"2026-06-15"}

← all The AI Crawler Registry · .md · JSON