What is robots.txt? How does a robots.txt file work?
Robots.txt, also known as robot exclusion protocol is a basic text file that stays within the root directory of your site. It informs the robots that are dispatched by the search engines about which pages to crawl and which to overlook. By utilizing these files you can manipulate who can visit and crawl your website. It is a text file used for SEO that contains commands for the search engines’ indexing robots that specific pages can or cannot be indexed. The robots.txt is not only to de-index the pages but also to prevent them from being browsed.
The main objective of using this file is to manage the crawl budget of the robot by restricting it from browsing other pages with low-added value, which must exist for the user journey.
How you can easily find your robots.txt files?
Once you have created your robots.txt file, you have to make it live.
You can technically place your robots.txt file in any main directory of your website but to increase the odds that your robots.txt file be easily found. Place it at:
(Note that your robots.txt file is case sensitive. So make sure to use a lowercase “r” in the filename)
Why robots.txt files are important for your website?
It is important to update your robots.txt file if you add files, directories, or pages to your website that you don’t wish to be indexed or accessed by web users or by the search engines. This will make sure that the security of your website and the best possible results with your Search Engine Optimization.
Usually, most websites don’t need to use robots.txt files because google will always find and index all the important pages of your website. And they’ll automatically not index the pages that aren’t important or are duplicate pages.
You can check the status of your indexed pages and how many pages are indexed on the Google Search Console
How to create a robots.txt file?
Being a text file, you can create it using windows notepad.
Doesn’t matter how ultimately you create your robots.txt file, the format will stay the same:
User-agent: X (When a program initiates a connection to a web server (whether it is a robot or a standard web browser), it gives basic information on its identity via an HTTP header called “user-agent”.)
Disallow: Y (Directives are rules that you want the declared user-agents to follow.)
User-agent is the specific bot you’re talking to.
Everything that comes after ‘disallow’ will include the pages you want to block
Here’s an example:
This rule would tell Googlebot not to index the image folder of your website.
You can also use an asterisk (*) to speak to any bots that stop by your website.
Here’s an example:
The “*” tells any spiders to NOT crawl your images folder.
This is just one of many ways to use a robots.txt file. This helpful guide from Google has more info on the different rules you can use to block or allow bots from crawling different pages of your site.
Three reasons why you should use robots.txt files
- Restrict the Non-Public Pages: Sometimes, there are some pages on your website that you don’t want to be indexed. The pages that need to exist but you don’t want random crawlers landing on them. In this case, you would use robots.txt to block these pages from search engine crawlers and bots.
- Maximize the crawl budget: If you’re facing a tough time getting all of your pages indexed, you might be having a crawl budget problem. By blocking unimportant pages with robots.txt, Googlebot can spend more of your crawl budget on the pages that obey matter.
- Prevent Indexing of Resources: Using Meta Directives (pieces of code that provide crawlers instructions for how to crawl or index web page content) can work just as well as robots.txt preventing the pages from getting indexed. Meta directives don’t work well for multimedia resources like images and pdfs.
Why do you need to block some pages?
There are three main reasons why you might want to block some pages by using robots.txt.
- If you have a page on your website which is a duplicate of any other page that you don’t want the robots.txt to index because that would be resulting in duplicate content which can hurt the SEO of your website.
- If you have a page on your website which you don’t want users to be able to access unless they’ll take a specific action.
- The other time that you’ll want to block other pages or files is when you want to protect the private files on your website.
So if you want to tell a bot to not crawl your page, you can type this:http://yoursite.com/page/
Understanding the limitations of the robots.txt file
Before creating or editing the robots.txt file, you should know the limits of a URL blocking method.
- Robots.txt directives may not be supported by all the search engines: The instructions in robots.txt files can not enforce the crawler behavior to your website; the crawler should obey them. While Googlebot and the other respectable web crawlers obey the instructions of a robot.txt file, other crawlers might not do the same. So, if you want to keep your information secure from the web crawlers, it’s better to use other blocking methods such as protecting the private files on your server using passwords.
- Different crawlers differently interpret syntax: Although the respectable web crawlers obey the directives in a robots.txt file, every crawler may interpret the directives differently. You need to know the proper syntax for addressing different types of web crawlers as some of them may not be able to understand certain instructions.
- A page that is disallowed in robots.txt can still be indexed if linked from other websites: while google will not index or crawl the content blocked by robots.txt, we may still index and find a disallowed Url if it is linked from other places from the web. Consequently, the address of the URL and, potentially other information that is available publicly such as anchor text in links to the page can still appear in SERP. You should password-protect the files on your server, use a noindex meta tag or response header or remove the page entirely.
The robots.txt allows you to forbid robots to access parts of your website, especially if an area of your page is private or if the content is not essential for search engines. Thus, robots.txt is an essential tool to control the indexing of your pages.
Here are a few examples of robots.txt files.
# Example 1: Block only Googlebot
# Example 2: Block Googlebot and Adsbot
# Example 3: Block all but AdsBot crawlers
Useful robots.txt rules
Here are some common useful robots.txt rules:
|Disallow crawling of the entire website||Keep in mind that in some situations URLs from the website may still be indexed, even if they haven’t been crawled.|
This does not match the various AdsBot crawlers, which must be named explicitly.
|Disallow crawling of a directory and its contents||Append a forward slash to the directory name to disallow the crawling of a whole directory.|
Remember, don’t use robots.txt to block access to private content; use proper authentication instead. URLs disallowed by the robots.txt file might still be indexed without being crawled, and the robots.txt file can be viewed by anyone, potentially disclosing the location of your private content.
|Allow access to a single crawler||Only Googlebot-news may crawl the whole site.|
|Allow access to all but a single crawler||Unnecessarybot may not crawl the site, all other bots may.|
|Disallow crawling of a single web page||For example, disallow the useless_file.html page.|
|Block a specific image from Google Images||For example, disallow the dogs.jpg image.|
|Block all images on your site from Google Images||Google can’t index images and videos without crawling them.|
|Disallow crawling of files of a specific file type||For example, disallow for crawling all .gif files.|
|Disallow crawling of an entire site, but allow Media partners-Google||This implementation hides your pages from search results, but the Media partners-Google web crawler can still analyze them to decide what ads to show visitors on your site.|