What is Robots.txt?

Robots.txt is a text file that tells search engines what to crawl on a site by using certain commands.

What Is Robots.txt and How Does It Work?

Robots.txt is a set of instructions for bots (especially search engines) to help them understand the structure and content of a website, so they can navigate it more effectively. The robots.txt exists as a text file in the HTML of a website; it is publicly available, but generally hidden from human view, unless a visitor chooses to read the HTML. This helps with search engine optimization (SEO) and ranking without impacting the user experience. Using robots.txt correctly is an essential element of technical optimization.

When bots crawl a site, it’s helpful for them to know where to go. Robots.txt acts like a map, guiding the bots to the most relevant pages for any search query. When done correctly, this text file can be highly effective. However, it can also be unclear, or in some cases flat out wrong, confusing bots, misdirecting them, or simply hiding important pages and content from them. 

The file is created in a plain text editor, then uploaded to the File Transfer Protocol (FTP) server. Some malicious robots can disregard the text file, so it’s important to protect websites and private information with other security. 

Part of the Robot Exclusion Protocol, robots.txt was created in 1994 by Martijn Koster after crawlers overwhelmed his website. Though not technically an “internet standard,” many web developers consider it so, and search engines adopted it to help manage their server resources. 

Importance of Robots.txt

Bots are constantly crawling and indexing websites for relevant information to any given search query. Websites want their best content indexed to rank well. Robots.txt files can tell search engines to avoid certain pages, therefore prioritizing others.

One example is internal result pages, like on an ecommerce site. These types of sites allow users to search for items on the actual site. Thousands of pages are created by internal searches, but websites don’t want search engines to crawl those pages because it takes too much time, and because these types of pages are only relevant to the user that typed the query, the information is useless. Instead, by following the robots.txt, search engines bypass all those pages and focus on content that matters to users.

What Is a Meta Robots Tag?

A meta robots tag is included in a website’s code, and tells robots how to index specific pages. This tag uses many commands like:

  • Noindex: Tells a search engine not to index a page.
  • Follow: Tells a crawler to follow all the links on a page.
  • Nofollow: Tells a crawler not to follow any links on a page.
  • Noimageindex: Tells a crawler not to index any images on a page.
  • None: Like using the noindex and nofollow tags together.
  • Noarchive: Search engines should not show a cached link to this page on a SERP.
  • Nosnippet: Tells a search engine not to show a snippet of this page on a SERP.

Robots.txt vs. Meta Robots Tags

While robots.txt and meta robots tags are similar in name and job, the main difference lies in their characteristics. 

Robots tags are used to control how bots interact with specific elements on individual webpages, while robots.txt files control how bots interact with file paths, affecting many pages at once across a site. Meta tags help make elements of a page easier to understand and contextualize, while the robots.txt instructs the bots on how to interact with a page. 

For example, a title tag helps crawlers understand what text represents the title and purpose of a page, whereas the robots.txt tells crawlers whether or not to even crawl or index a page. While both can be used, a search engine will prioritize robots.txt before a meta robots tag.

Want to learn more?
Visit our blog to learn more about search and search engine optimization.
To The Blog
SEO Keyword Research

SEO Keyword Research

Read our comprehensive SEO keyword research guide to learn how you can get your web pages to show up higher in the SERPs.

Link Building Guide

Link Building Guide

Check out our ultimate link building guide to learn how to earn powerful backlinks to empower your web content in search.

How To Use Robots.txt

Robots.txt files can instruct bots to stay out of certain pages. The file is placed in the top-level directory of the webserver. When a robot looks for a text file, it takes everything in the file path that comes after the first single backslash and replaces it with “/robots.txt”. 

In each file, there is a user-agent and a command. A webmaster can specify how a user-agent, or web crawling software, crawls their site. In the commands, they can specify what pages, folders, or individual URLs to crawl. To save time for the web master, sitemaps are used in the text file to tell web crawlers about available web pages and their metadata. 

Amazon’s own search engines create thousands of temporary pages every day when users search on the site. To make sure none of these pages show up on the results pages, Amazon uses robots.txt to disallow search engines from crawling or indexing duplicate product pages to save time. 

How To Write Robots.txt

Writing robots.txt may look overwhelming, but once practiced, it follows a basic pattern. All text files are written in blocks of rules created for specific user-agents of search engines. These blocks are separated by line breaks. Each block can have multiple user-agents, and each user-agent can have multiple blocks. 

Common Expressions

A few common expressions are used to specify the robots.txt:

  • Asterisk (*): Indicates all robots. 
  • Dollar Sign: Signifies the end of a URL string.
  • Pound Sign or Hashtag (#): Indicates a note from the developer and will not be taken into account by user-agents.

Disallow

This command tells the search engines to avoid crawling a web page or folder. There must be a separate disallow line for each file path, unless the webmaster is using a sitemap.

Allow

For most user-agents, this command is their default behavior, so it doesn’t need to be used. This command only applies to the Google user-agent, Googlebot. It allows the user-agent to access a subfolder, even if the parent folder has been ‘disallowed’. 

Robots.txt Best Practices

To use robots.txt effectively, there are a few best practices to keep in mind:

  • Some search engines have multiple user-agents. For instance, Google has both Googlebot and Googlebot-image. 
  • Each subdomain needs its own robots.txt file, as well as the root domain. 
  • The robots.txt file is case sensitive, but the commands are not. 
  • Robots.txt file cannot hide private user information. This is where a meta robots tag can come in handy.
  • Add a sitemap to the robots.txt file to allow the search engine to crawl the site more efficiently. 

Creating and using robots.txt not only enhances SEO, but it also streamlines the user experience. Search engines can crawl and index better, allowing websites to appear in the SERPs in the best, most relevant ways possible.