Robots.txt is a text file that tells search engines what to crawl on a site by using certain commands.
Robots.txt is a set of instructions for bots (especially search engines) to help them understand the structure and content of a website, so they can navigate it more effectively. The robots.txt exists as a text file in the HTML of a website; it is publicly available, but generally hidden from human view, unless a visitor chooses to read the HTML. This helps with search engine optimization (SEO) and ranking without impacting the user experience. Using robots.txt correctly is an essential element of technical optimization.
When bots crawl a site, it’s helpful for them to know where to go. Robots.txt acts like a map, guiding the bots to the most relevant pages for any search query. When done correctly, this text file can be highly effective. However, it can also be unclear, or in some cases flat out wrong, confusing bots, misdirecting them, or simply hiding important pages and content from them.
The file is created in a plain text editor, then uploaded to the File Transfer Protocol (FTP) server. Some malicious robots can disregard the text file, so it’s important to protect websites and private information with other security.
Part of the Robot Exclusion Protocol, robots.txt was created in 1994 by Martijn Koster after crawlers overwhelmed his website. Though not technically an “internet standard,” many web developers consider it so, and search engines adopted it to help manage their server resources.
Bots are constantly crawling and indexing websites for relevant information to any given search query. Websites want their best content indexed to rank well. Robots.txt files can tell search engines to avoid certain pages, therefore prioritizing others.
One example is internal result pages, like on an ecommerce site. These types of sites allow users to search for items on the actual site. Thousands of pages are created by internal searches, but websites don’t want search engines to crawl those pages because it takes too much time, and because these types of pages are only relevant to the user that typed the query, the information is useless. Instead, by following the robots.txt, search engines bypass all those pages and focus on content that matters to users.
A meta robots tag is included in a website’s code, and tells robots how to index specific pages. This tag uses many commands like:
While robots.txt and meta robots tags are similar in name and job, the main difference lies in their characteristics.
Robots tags are used to control how bots interact with specific elements on individual webpages, while robots.txt files control how bots interact with file paths, affecting many pages at once across a site. Meta tags help make elements of a page easier to understand and contextualize, while the robots.txt instructs the bots on how to interact with a page.
For example, a title tag helps crawlers understand what text represents the title and purpose of a page, whereas the robots.txt tells crawlers whether or not to even crawl or index a page. While both can be used, a search engine will prioritize robots.txt before a meta robots tag.
Robots.txt files can instruct bots to stay out of certain pages. The file is placed in the top-level directory of the webserver. When a robot looks for a text file, it takes everything in the file path that comes after the first single backslash and replaces it with “/robots.txt”.
In each file, there is a user-agent and a command. A webmaster can specify how a user-agent, or web crawling software, crawls their site. In the commands, they can specify what pages, folders, or individual URLs to crawl. To save time for the web master, sitemaps are used in the text file to tell web crawlers about available web pages and their metadata.
Amazon’s own search engines create thousands of temporary pages every day when users search on the site. To make sure none of these pages show up on the results pages, Amazon uses robots.txt to disallow search engines from crawling or indexing duplicate product pages to save time.
Writing robots.txt may look overwhelming, but once practiced, it follows a basic pattern. All text files are written in blocks of rules created for specific user-agents of search engines. These blocks are separated by line breaks. Each block can have multiple user-agents, and each user-agent can have multiple blocks.
A few common expressions are used to specify the robots.txt:
This command tells the search engines to avoid crawling a web page or folder. There must be a separate disallow line for each file path, unless the webmaster is using a sitemap.
For most user-agents, this command is their default behavior, so it doesn’t need to be used. This command only applies to the Google user-agent, Googlebot. It allows the user-agent to access a subfolder, even if the parent folder has been ‘disallowed’.
To use robots.txt effectively, there are a few best practices to keep in mind:
Creating and using robots.txt not only enhances SEO, but it also streamlines the user experience. Search engines can crawl and index better, allowing websites to appear in the SERPs in the best, most relevant ways possible.