Robots.txt File Explanation

Learn what a robots.txt file is and how to make one to help manage bots accessing your website.

A robots.txt file is a way for website owners to mange how bots, spiders and crawlers fetch information from the website. It can be used to help prevent a website from being overwhelmed by the traffic from non-humans.

HTTP Caching Headers

This article will explain how to make a robots.txt file and details the syntax for each rule that you can put inside of it. We will also show you the most commonly blocked bots, so you can do the same. If you just want to test your website to see if its robots.txt file is formatted correctly, check out our Robots.txt Validator.

What a Robots.txt File Does

Any website on the internet will receive two types of traffic. Humans, who are looking for information or services and bots which are gathering information for a variety of purposes. By some estimates, more than 40% of the traffic on the internet is bots. Some bots are good. For example Google runs a bot that fetches websites and indexes them so that they can appear in search results. Most websites want their pages to appear in search results, so they want GoogleBot to access their website. Some bots are bad. These bots may be slurping up information to train AI models, or they may be looking for information that can be harvested and sold.

A robots.txt file is a way for a website to offer suggestions about how bots should access the website. Good bots may respect these suggestions, but bad bots likely will not. If you want to block the bad bots, you will need to use other mechanisms (see below).

Here are some things that you can do with a robots.txt file:

  • Block all bots from accessing a certain part of the website. This can be used to prevent unimportant information from being crawled by bots.
  • Block a specific bot from accessing all or parts of the website. This can be used to allow good bots and block some bad bots.
  • Limit the rate that bots can fetch pages from the website. By making bots slow down, it can reduce the load on the server.
  • Suggest that certain resources be ignored. For example, if the webpage has images that are not important for bots to see, you can block these unimportant images. Humans will still see them.
  • Give bots a link to the sitemap file, which provides them with a list of important pages. This can make bots more efficient at crawling the website.

What a Robots.txt File Doesn't Do

A robots.txt file is not a way to prevent parts of a website from appearing in search engine results. If a page has a link going to it from any other place on the internet, that the bot has access to, then the page can still appear in search engine results. If you don't want search engines linking to parts of your website, you should instead use the "noindex" meta tag or HTTP header (see below). You could also place the content into a password protected area.

A robots.txt file is not a way to block malicious bots. Because they are malicious, they won't obey the suggestions. To enforce the rules in your robots.txt file, and block bad bots, you will need to use other technologies (see below).

Where to Place Your Robots.txt File

Give your file the name "robots.txt" all lower case and place it at the root level of your website. It should be a plain text file with UTF-8 character encoding. Make it accessible at this location:

https://www.example.com/robots.txt

If you are using other technologies to block bots (like a firewall), make sure that the robots.txt file isn't itself being blocked. If a bot cannot access the robots.txt file, then it can't follow the rules.

Your site can only have 1 robots.txt file and it should be smaller than 512KB in size. The rules inside will apply to all content using the same protocol and host. If the robots.txt file was found at https://www.example.com/robots.txt then it will only apply to content that starts with https://www.example.com/ and will not apply to content hosted at different subdomains, or unencrypted content starting with "http://". Of course, you can put additional robots.txt files at those locations to manage those other files.

Bots may cache the robots.txt file for a day or longer, so when you make a change, there may be some lag time before the bots fetch the new rules and start applying them.

Syntax

A robots.txt file contains a list of rules, one per line. Each line contains a rule name, a colon and then a value. Space characters are optional. A comment can be added by using a # sign followed by some text. The comment extends to the end of the line and will be ignored by bots. The file is processed from top to bottom.

The rules can be grouped to target one or more different bots. The "User-agent" rule starts a group and indicates which bot(s) the subsequent rules apply to. If there are multiple groups for the same User-agent, then the bot will combine them into one set of rules.

Within each group, a series of "Allow" and "Disallow" rules can be listed to control what the bot can access. By default, all pages of the site are allowed, unless specifically disallowed by a rule. If there are duplicate or conflicting rules, the bot will use the rule that is the most specific (has the most characters). If two conflicting rules have the same length, the bot will likely use the least restrictive rule.

User Agent

The User-agent rule begins a group. It indicates which bot should obey the rules that follow. The group ends when another User-agent line is encountered.

User-agent: ExampleBot

The special character * is a wildcard character. It will match all bots that do not otherwise have a group that applies to them.

User-agent: *

If there are two groups that apply to the same bot, the bot will use the one that is most specific (has the longest User-agent name), or if they are identical then they will be combined into one group. For example, if both of the groups above were included in the file, then a bot named "ExampleBot" would follow only the rules inside its named group and none of the rules in the * wildcard group. A bot named "AnotherBot" would follow only the rules in the * wildcard group.

You can list multiple User-agents for one group by including multiple lines, as long as no other rules appear in between.

User-agent: ExampleBot
User-agent: ExampleBot2
...rules for both bots...

Allow / Disallow

Within each group, a list of Allow or Disallow rules should indicate which content the bot is allowed or not allowed to access. A rule with nothing after the colon will be ignored.

Disallow: /secret/
Allow: /secret/actually_not_secret/

In the above group of rules, anything in the "secret" directory will be blocked except for those pages inside the "actually_not_secret" sub-directory. This is because the second rule is more specific (more characters) so it takes precedence over the conflicting rule.

The * characters is a wildcard character that represents 0 or more instances of any other character. All rules have an implicit * character at the end. For example, the following rule will block all pages that end with ".htm" as well as all pages that end with ".html".

Disallow: /*.htm

Rules are case sensitive, so the following rule will not block "IMAGE.JPG".

Disallow: /image.jpg

To remove the implicit * at the end of every rule, use the special $ character. The $ character indicates the end of a URL. For example, the following rule would block "example.com/foo" but not "example.com/foo/bar".

Disallow: /foo$

Sitemap

This rule tells bots where to find one or more sitemap files. A sitemap is an XML file that lists all of the important pages on the website along with some extra information that can be helpful for bots. It helps bots crawl the website more efficiently. This rule doesn't apply to groups, so all bots will see the same sitemap. It doesn't matter where you put it in your file, but we recommend putting it at the top.

Sitemap: https://www.example.com/sitemap1.xml
Sitemap: https://www.example.com/sitemap2.xml

If your website has multiple sitemaps, list all of them on separate lines. To learn more about sitemaps, you can visit Sitemaps.org.

Crawl-Delay

This rule is not standard, but some bots obey it. It tells the bot how quickly it is allowed to crawl the website. You can use this to tell a bot to slow down. The value indicates the number of seconds to delay between subsequent requests.

Crawl-delay: 10

Notably, Google does not obey this rule.

Putting it All Together

Here is a complete example that puts all the different pieces together.

Sitemap: https://www.example.com/sitemap.xml

# This is a comment for humans.
# The following group applies to all bots not named elsewhere in the file
User-Agent: *
Disallow: /private/     #Asks bots to avoid a few areas of the website
Disallow: /unimportant/
Disallow: *.gif         #This rule will block all gif images from being accessed

User-Agent: AgressiveBot
Crawl-delay: 2          #Asks this bot to slow down
Disallow: /private/
Disallow: /unimportant/

User-Agent: GoodBot
Allow: /       #While not strictly necessary, this rule gives the GoodBot permission to crawl the entire site

User-Agent: BadBot1
User-Agent: BadBot2
Disallow: /    #These two bad bots will be blocked from the entire site


Commonly Blocked Bots

Some bots suck up all the information on your website and use it to train AI models. These AIs can be used to generate text and images based upon your content. If you believe that this constitutes plagiarism then you may want to prevent it. These are the most commonly blocked AI scraping bots.

User-agent: anthropic-ai
User-agent: Applebot-Extended
User-agent: Bytespider
User-agent: CCBot
User-agent: Claude-Web
User-agent: ClaudeBot
User-agent: cohere-ai
User-agent: Diffbot
User-agent: FacebookBot
User-agent: Google-Extended
User-agent: GPTBot
User-agent: Meta-ExternalAgent
User-agent: omgili
User-agent: omgilibot
User-agent: Timpibot

Other bots may scan your website for SEO purposes. They harvest data to do keyword analysis which may help your competitors. If you do not use these tools, you may want to block them. These are the most commonly blocked SEO bots.

User-agent: AhrefsBot
User-agent: MJ12bot
User-agent: SemrushBot
ValidBot can test your website

Validate Your Website

We hope that this article has helped you understand the importance of a robots.txt file. Once you have added it to your website, you should test it. Enter your domain name below and run a free ValidBot test that checks the robots.txt file as well as running 100+ additional tests across a wide range of areas. If you only want to test your robots.txt file, you can use our standalone Robots.txt Validator instead.