Why do you need a robots.txt file
The robots.txt file contains information that search robots use when scanning a website. Crawlers learn from robots.txt which sections of the site, page types, or specific pages should not be crawled. Using a file, you exclude content from the index of search engines that you do not want to show to search engines. You can also prevent the indexing of duplicate content.
If you use robots.txt incorrectly, it can cost you dearly. An erroneous prohibition on crawling will exclude important sections, pages, or even entire content from the index. In this case, it is difficult for you to count on successful website promotion.
How to work with robots.txt file
The robots.txt text file contains instructions for search engine robots. Usually, it is used to prohibit the scanning of service sections of the site, duplicate content or publications that are not intended for the entire audience.
If you do not need to block any content from scanning, you can leave the robots.txt empty or in its default state. In this case, the entry in the file will look like this:
If for some reason you are going to completely block the site for search robots, the entry in the file will look like this:
To use robots.txt correctly, you must have an idea of the levels of directives:
- Page level. In this case, the directive looks like this: Disallow: /blog.html.
- Folder level. At this level, directives are written like this: Disallow: /example-folder/.
- Content type level. For example, if you do not want robots to index .pdf files, use the following directive: Disallow: /*.pdf.
Keep in mind the most common errors encountered when compiling robots.txt:
● Complete ban on site indexing by search robots
In this case, the directive looks like this:
Why create a site if you do not allow search engines to crawl it? The use of this directive is appropriate at the development or improvement phase of a resource.
● Ban on crawling indexed content
For example, a webmaster may prohibit scanning folders with videos and images:
It is difficult to imagine a situation in which a ban on crawling indexed content would be justified. Typically, such actions deprive the site of possible traffic.
● Using the allow attribute
This action makes no sense. Search engines crawl all available content by default. Using the robots.txt file, you can prevent scanning, but you do not need to allow anything to be indexed.
Robots.txt file validation tool
Google offers a free tool to check and validate the robots.txt file. This tool is available in the webmaster dashboard. To find it, use the menu https://www.google.com/webmasters/tools/robots-testing-tool
This tool solves the following tasks:
- Displays the current version of the robots.txt file.
- Editing and verifying the robots.txt file directly in the panel for webmasters.
- View old file versions.
- Check for blocked URLs.
- View robots.txt file error messages.
If Google does not index individual pages or entire sections of your site, the tool will help you check in a few seconds whether this is due to robots.txt file errors.
You can make changes to robots.txt and check its correctness. To do this, just specify the URL you are interested in and click the "Check" button.
Google spokesman John Miller recommends that all site owners check the robots.txt file with this tool. According to the expert, after spending a few seconds checking, the webmaster can identify critical errors that prevent Google crawlers.
How to use it correctly?
You need to understand the practical meaning of the robots.txt file. This file is used to restrict access to the site for search engines. If you want to prevent robots from crawling a page, a section of a site, or type of content, introduce the appropriate robots.txt directive. Verify that the file is being used correctly with the tool available in the Google Search Console. This will help you quickly detect and fix errors, as well as make the necessary changes to robots.txt.
More than 10 years in the game and we're just getting started.