Overview of Google crawlers and fetchers (user agents)
Google uses crawlers and fetchers to perform actions for its products, either automatically or triggered by user request.
Crawler, robot or spider is a generic term for any program that is used to automatically discover and scan websites by following links from one web page to another. Google’s main crawler used for Google Search is called Googlebot.
Introduction:
Web crawlers, also known as spiders or bots, are automated programs used by search engines to systematically browse the internet and index web pages. In this comprehensive guide, we’ll delve into the world of web crawlers, focusing on Google’s crawler, Googlebot, its types, and its relationship with Google Search Console (GSC) and backlinks.
Types of Web Crawlers:
Common crawlers
Google’s common crawlers are used for building Google’s search indices, perform other product specific crawls, and for analysis. They always obey robots.txt rules and generally crawl from the IP ranges published in the googlebot.json object.
Googlebot Smartphone | User agent token Googlebot |
Googlebot Desktop | User agent token Googlebot |
Googlebot Image | Used for crawling image bytes for Google Images and products dependent on images. User agent tokens Googlebot-Image Googlebot |
Googlebot News | Googlebot News uses Googlebot for crawling news articles, however it respects its historic user agent token Googlebot-News. User agent tokens Googlebot-News Googlebot |
Googlebot Video | Used for crawling video bytes for Google Video and products dependent on videos. User agent tokens Googlebot-Video Googlebot |
Google StoreBot | Google StoreBot crawls through certain types of pages, including, but not limited to, product details pages, cart pages, and checkout pages. User agent token Storebot-Google |
Google-InspectionTool | Google-InspectionTool is the crawler used by Search testing tools such as the Rich Result Test and URL inspection in Search Console. Apart from the user agent and user agent token, it mimics Googlebot. User agent token Google-InspectionTool Googlebot |
GoogleOther | GoogleOther is the generic crawler that may be used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development. User agent token GoogleOther Full user agent string GoogleOther |
Google-Extended | Google-Extended is a standalone product token that web publishers can use to manage whether their sites help improve Gemini Apps and Vertex AI generative APIs, including future generations of models that power those products. Google-Extended does not impact a site’s inclusion or ranking in Google Search. User agent token Google-Extended |
Special-case crawlers
The special-case crawlers are used by specific products where there’s an agreement between the crawled site and the product about the crawl process. For example, AdsBot ignores the global robots.txt user agent (*) with the ad publisher’s permission. The special-case crawlers may ignore robots.txt rules and so they operate from a different IP range than the common crawlers. The IP ranges are published in the special-crawlers.json object.
APIs-Google | Used by Google APIs to deliver push notification messages. Ignores the global user agent (*) in robots.txt. User agent token APIs-Google |
AdsBot Mobile Web Android | Checks Android web page ad quality. Ignores the global user agent (*) in robots.txt. User agent token AdsBot-Google-Mobile |
AdsBot Mobile Web | Checks iPhone web page ad quality. Ignores the global user agent (*) in robots.txt. User agent token AdsBot-Google-Mobile |
AdsBot | Checks desktop web page ad quality. Ignores the global user agent (*) in robots.txt. User agent token AdsBot-Google |
AdSense | The AdSense crawler visits your site to determine its content in order to provide relevant ads. Ignores the global user agent (*) in robots.txt. User agent token Mediapartners-Google |
Mobile AdSense | The Mobile AdSense crawler visits your site to determine its content in order to provide relevant ads. Ignores the global user agent (*) in robots.txt. User agent token Mediapartners-Google |
Google-Safety | The Google-Safety user agent handles abuse-specific crawling, such as malware discovery for publicly posted links on Google properties. This user agent ignores robots.txt rules. Full user agent string Google-Safety |
Crawling Process:
Googlebot begins its crawl by fetching a few web pages and then following the links on those pages to discover new content. It prioritizes pages based on factors like popularity, relevance, and freshness. Google uses complex algorithms to determine crawling frequency and depth for each website.
Google Search Console (GSC) and Crawling:
GSC provides webmasters with valuable insights into how Google crawls and indexes their websites. It allows site owners to monitor crawl errors, submit sitemaps, and analyze indexing data. By utilizing GSC, webmasters can optimize their websites for better crawling and indexing performance.
Crawl Budget:
Crawl budget refers to the number of pages Googlebot can crawl and index on a website within a given time frame. It is influenced by factors like site speed, server performance, and crawl demand. Optimizing crawl budget ensures that Googlebot focuses on crawling the most important pages of a website.
Crawl Rate and Frequency:
Crawl rate determines how frequently Googlebot crawls a website. Websites with high-quality content, fast load times, and low server errors are crawled more frequently. Optimizing crawl rate involves improving site performance and ensuring a smooth crawling experience for Googlebot.
Relation with Backlinks:
Backlinks play a crucial role in web crawling and indexing. They act as pathways for crawlers to discover new web pages and assess their relevance and authority. High-quality backlinks from reputable websites can improve a website’s crawlability and search engine visibility.
Best Practices for Optimizing for Web Crawlers:
Optimizing websites for web crawlers involves various strategies, including creating XML sitemaps, optimizing robots.txt files, and improving site structure and internal linking. Providing clear navigation and high-quality content also enhances crawlability and indexing.
Conclusion:
Understanding web crawlers, Googlebot, and their relationship with Google Search Console and backlinks is essential for website owners and marketers. By optimizing websites for web crawling and indexing, businesses can improve their search engine visibility, drive organic traffic, and ultimately achieve their online objectives.