Diffbot

Diffbot · data-aggregation

name
Diffbot
operator
Diffbot
purpose
data-aggregation
ua_substring
Diffbot
robots_token
Diffbot
respects_robots
yes
verify
no operator-published authoritative IP-range file confirmed; verify by user-agent + edge controls. Diffbot documents that Crawlbot adheres to robots.txt by default.
notes
Diffbot's Crawlbot extracts and structures web content into a knowledge graph sold to customers (market intelligence, e-commerce, AI training). Registered as a 'data-provider' (Agents Welcome taxonomy extension). Diffbot documents that crawls adhere to robots.txt (disallow + crawl-delay) by default.
canonical_name
Diffbot
user_agent_token
Diffbot
ua_full
— verify-against-primary-at-build ↗ https://docs.diffbot.com/docs/does-crawl-respect-robotstxt
bot_type
data-provider
bot_type_extension
data-provider (Agents Welcome registry extension beyond the cited 6-type set)
opt_out_mechanism
robots.txt disallow (User-agent: Diffbot)
published_ip_range_url
— verify-against-primary-at-build ↗ https://docs.diffbot.com/docs/does-crawl-respect-robotstxt
asn
— verify-against-primary-at-build ↗ https://docs.diffbot.com/
reverse_dns_suffix
— verify-against-primary-at-build ↗ https://docs.diffbot.com/
supports_web_bot_auth
— verify-against-primary-at-build ↗ https://docs.diffbot.com/
signature_agent_domain
— verify-against-primary-at-build ↗ https://docs.diffbot.com/
jwks_url
— verify-against-primary-at-build ↗ https://docs.diffbot.com/
verification_methods
user-agent-match
crawl_traffic_share
— verify-against-primary-at-build ↗ https://radar.cloudflare.com/bots
targeted_content_type
HTML, text, structured data
documentation_url
https://docs.diffbot.com/docs/does-crawl-respect-robotstxt
first_seen_date
— verify-against-primary-at-build ↗ https://docs.diffbot.com/
last_verified_date
2026-06-15
block_vs_allow_recommendation
conditional — data-provider crawler that structures content for resale (incl. downstream AI training); allow if you want representation in Diffbot's knowledge graph, block via robots.txt to opt out. No direct referral.
citation_referral_value
low (data aggregation for resale; no direct citation or referral)
cloudflare_verified_category
— verify-against-primary-at-build ↗ https://radar.cloudflare.com/bots/directory/diffbot
status
active
triples
["Diffbot","operated_by","Diffbot"] ["Diffbot","has_bot_type","data-provider"] ["Diffbot","verified_via","user-agent-match"]
attribute_sources
{"claims":["user_agent_token","robots_token","respects_robots","documentation_url","opt_out_mechanism"],"source":"https://docs.diffbot.com/docs/does-crawl-respect-robotstxt","last_verified":"2026-06-15"} {"claims":["operator","bot_type"],"source":"https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.json","last_verified":"2026-06-15"}

← all The AI Crawler Registry · .md · JSON