ICC-Crawler
NICT (National Institute of Information and Communications Technology) · training
- name
- ICC-Crawler
- operator
- NICT (National Institute of Information and Communications Technology)
- purpose
- training
- ua_substring
- ICC-Crawler
- robots_token
- ICC-Crawler
- respects_robots
- yes
- verify
- verify by user-agent + edge controls; the ai.robots.txt registry records respects-robots = Yes. No operator-published IP-range file confirmed.
- notes
- Crawls data to train and support AI technologies; NICT (Japan) uses the collected data for AI and may provide it to third parties, including commercial companies. Token and operator recorded in the ai.robots.txt machine-readable registry.
- canonical_name
- ICC-Crawler
- user_agent_token
- ICC-Crawler
- ua_full
- — verify-against-primary-at-build ↗ https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.json
- bot_type
- training
- bot_type_extension
- —
- opt_out_mechanism
- robots.txt disallow (User-agent: ICC-Crawler)
- published_ip_range_url
- — verify-against-primary-at-build ↗ https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.json
- asn
- — verify-against-primary-at-build ↗ https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.json
- reverse_dns_suffix
- — verify-against-primary-at-build ↗ https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.json
- supports_web_bot_auth
- — verify-against-primary-at-build ↗ https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.json
- signature_agent_domain
- — verify-against-primary-at-build ↗ https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.json
- jwks_url
- — verify-against-primary-at-build ↗ https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.json
- verification_methods
user-agent-match- crawl_traffic_share
- — verify-against-primary-at-build ↗ https://radar.cloudflare.com/bots
- targeted_content_type
- HTML, text
- documentation_url
- — verify-against-primary-at-build ↗ https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.json
- first_seen_date
- — verify-against-primary-at-build ↗ https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.json
- last_verified_date
- 2026-06-15
- block_vs_allow_recommendation
- conditional — research/training crawler that may share collected data with third parties incl. commercial companies; allow to be represented, block via robots.txt to opt out. No direct referral.
- citation_referral_value
- low (training/data collection; no direct citation or referral)
- cloudflare_verified_category
- — verify-against-primary-at-build ↗ https://radar.cloudflare.com/bots/directory/icc-crawler
- status
- active
- triples
["ICC-Crawler","operated_by","NICT"]["ICC-Crawler","has_bot_type","training"]["ICC-Crawler","verified_via","user-agent-match"]- attribute_sources
{"claims":["user_agent_token","robots_token","operator","respects_robots","purpose"],"source":"https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.json","last_verified":"2026-06-15"}