Complete list of AI crawl bots and what they do

What are they types of AI crawling bots?

There are different types of AI crawlers with different functions that are sometimes clearly distinct, and other times their roles overlap. The three roles are:

Training → To build or improve models
Search / indexing → To discover and surface pages in AI search
User-action retrieval → To fetch pages in response to a live query

Let’s look at OpenAI for example. Their bots have distinct roles.

GPTBot → Specifically a training crawler
ChatGPT-User → Fetches pages when a user asks for them
OAI-SearchBot → Indexes for ChatGPT search

For additional info on role management for AI crawl bots, scroll down below the table.

AI crawl bots organized by purpose

Bot	Claimed purpose	Practical category	Notes
GPTBot	Collects content that may be used to train generative AI foundation models	Training	Separate from search and user actions; blocking affects future training use
OAI-SearchBot	Surfaces websites in ChatGPT search results	Search	Allowing it supports ChatGPT search visibility; separate from GPTBot
ChatGPT-User	Used for certain user actions in ChatGPT and Custom GPTs	User Action	Not used for automatic crawling or Search eligibility
ClaudeBot	Collects web content for model training	Training	Distinct from Claude-SearchBot and Claude-User
Claude-SearchBot	Crawls the web to improve Claude search results	Search	Separate search/indexing crawler
Claude-User	Fetches pages when a Claude user asks a question	User Action	User-initiated retrieval; not a training crawler
Meta-ExternalAgent	Used for AI indexing/training	Training	Meta’s primary AI crawler
Meta-ExternalFetcher	User-prompted fetches / direct viewing flows	User Action	Not treated as a crawler in some third-party bot analyses
Meta-ExternalAds	Used to improve advertising and products	Ads / analytics	Separate from training and search
Googlebot	General Google crawling for search and related indexing	Mixed Purpose	Best treated as search/mixed, not pure training
GoogleOther	Additional Google crawling outside the main search crawler	Mixed / Other	Distinct from Googlebot in verified-bot lists
Bingbot	Bing crawling for search and related AI surfaces	Mixed Purpose	Best treated as search/mixed, not pure training
Amazonbot	Crawls for Amazon search and related discovery	Search	Commonly treated as a search-oriented crawler
Applebot	Apple crawling for search/services indexing	Search / Mixed	Not clearly a training-only bot
Bytespider	AI-related web crawling used for training	Training	Commonly listed as training-oriented
TikTokSpider	User-facing fetch / preview behavior	User Action	More akin to user-triggered retrieval than training
PerplexityBot	Crawls for search / answer retrieval	Search	Search-oriented crawler
Perplexity-User	Fetches on behalf of a user query	User Action	User-initiated retrieval
MistralAI-User	User-requested fetches	User Action	User-triggered browsing
PetalBot	Huawei crawler used for AI/search-related purposes	AI Crawler	Often grouped as an AI crawler in verified-bot lists
AI2Bot	Research/training crawling	Training	Long-tail training bot
CCBot	Web-scale crawl used for training datasets	Training	Widely recognized dataset source
Timpibot	Training-oriented crawler	Training	Long-tail training bot
Cotoyogi	Training-oriented crawler	Training	Long-tail training bot
Diffbot	Extraction/knowledge graph style crawling	Training / extraction	Better treated as data extraction/indexing than a search bot
YouBot	Search crawler	Search	Search-oriented crawler
Sidetrade	Search/minor crawler	Search	Minor search bot
aiHitBot	Search/minor crawler	Search	Minor search bot

What do we know about training crawler access

Training crawlers are used to collect and process data for model development or fine tuning or updating models. Public documentation from AI companies provides high-level intent, but the exact weighting, retention, and reuse of content inside models is largely a black box.

So we don’t have clear documentation that suggests allowing training bots will lead to higher visibility, increased citations, or better AI answer results. What we do know is that:

Training data contributes to overall model behavior
Content use does not translate to attribution
Retrieval systems (search + user-action) are separate from training pipelines

The relationship between training access and downstream visibility is indirect and not well quantified.

Search and indexing crawlers

Search crawlers influence whether your site can appear in AI search products, answer engines, or hybrid search interfaces. They operate more like traditional search bots and are responsible for:

Discovering pages
Parsing content
Making that content available to retrieval systems

User-action retrieval crawlers

User-action crawlers are triggered in real time to fetch pages if a user asks a question or the answer engine needs to verify or pull content from a live source. These crawlers operate on demand and determine whether your page can be quoted, summarized, or directly referenced in a live response.

Should you block AI crawl bots?

Control over these crawlers usually happens at two levels:

robots.txt directives (allow / disallow specific user agents)
network-level filtering (WAF, bot management tools, CDN rules)

Cloudflare and other platforms include bot management and “bot fight” modes that can automatically challenge or block traffic identified as non-human. Similar controls exist across other CDNs and security layers.

Be careful flipping a bot control toggle. You may have unintended consequences.

Blocking a training crawler means your content is excluded from future training runs (where respected), but does not affect existing models
Blocking a search crawler reduces or removes your content from that system’s discovery layer
Blocking a user-action crawler prevents real-time fetching, which can limit how your content appears in live answers