Thursday, October 2, 2014

Robots.txt - The Complete Guide With Examples

Robots.txt is a text file that is used to stop web crawlers from accessing certain files and folders on the server which the webmaster does not wants to appear in search engines. Without any specific instructions, the search engine crawlers access each and every folder and their associated files present in the server so there are chances that some private information may get public if proper commands are not specified in the robots.txt file. Also, this file needs to be present in the root server only.

WWW Robots (also called wanderers, crawlers or spiders) are a set of programs that continuously visit many pages in the World Wide Web by recursively retrieving linked pages. Robots.txt is a notepad file that contains commands that are used to direct such web robots to access or deny crawling of certain files or folders. The robots.txt is the first file which the crawlers read when they access any web server. Although, it is important to note here that certain web crawler programs may not abide by the instructions provided in the robots.txt file and might crawl the private files and folders as well. But, in general robots follow the commands provided in the file.

The Robots Exclusion Standard - A Short History

The Robots Exclusion Standard or the Robots Exclusion Protocol is a set of rules advising web crawlers or robots to ignore certain parts of a website that are restricted from public viewing. The credit of proposing the "Robots Exclusion Protocol" is attributed to Martijn Koster, who suggested it when working for Nexor around 1994. The file "robots.txt" was popularized shortly after when this file was used to restrict the Aliweb (one of the earliest search engines) crawlers. 

Martijn Koster - The man behind the creation of robots.txt

The Robots Exclusion Standard is not an official standard backed by a standards body, or owned by any commercial organisation. This protocol is not governed by any organization and as such not enforced by anybody. There is no guarantee that all current and future robots will use it. 

The robots.txt file consists of  5 major parameters (fields or directive):

A- User-agent - This field holds the the robot value. For example if instructions are for Google search engine bot, then this field will hold the value:

User-agent: googlebot

B- Disallow - This field specifies the names/paths of files and folders which the crawlers must ignore. For example, if the folder "passwords" needs to be disallowed then the following command will be written:

Disallow: /passwords

C- Hash # - This is a comment parameter. If you wish to add certain lines of text in the file which you do not wish to execute then you can use the hash tag. The line below can be used for the above disallow command:

Disallow: /passwords # this line will disallow the folder named password

D- Allow - Just the opposite of disallow. This allows crawling of all files and folders. The line below will allow crawling of the folder named "passwords"

allow: /passwords

E- Crawl-delay - This parameter will set the number of seconds to wait between successive requests to the same server. For example, if you want the crawlers to wait for 5 seconds, the following command needs to be written:

Crawl-delay: 5

## Note that crawl-delay paramter is not supported by Google and Yahoo. 

Where Should You Place the Robots.txt File?

It should be placed in the root folder of your server. It will generate URL something like this:

Please note that robots.txt is a publicly available file so any one on the web can directly visit the URL and see the contents of your file.

When Should You Use Robots.txt?

1- Prevent indexing of an unannounced site.
2- Prevent crawling of an infinite URL space.
3- Prevent indexing of search folder.
4- Block indexing of customer account information.
5- Block all checkout and payment related information.
6- Prevent indexing of duplicate files and folders which does not serves any user purpose.
7- Block crawling of individual user reviews on site.
8- Disallow crawling of widgets and CMS related folders.
9- Disallow accessing the customer cart folder.
10- Prevent indexing of online chats happening on the site. etc..

List of Popular Robots User Agents

sogou spider
Speedy Spider
TweetedTimes Bot
Yahoo! Slurp
Yahoo! Slurp China


1- To restrict crawling of URL's starting with /banana/cookie/ and the file named apple.html

# robots.txt for

User-agent: *
Disallow: /banana/cookie/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear
Disallow: /apple.html

2- To exclude all crawlers from the entire server

User-agent: *
Disallow: /

3- To allow crawling of all files and folders

User-agent: *

## You don't need to use any robots.txt file to allow crawling of your entire server because by default every cralwer will access the contents on your server.

4- To exclude all pages generated dynamically and having the parameter "reply" in them 

user-agent: *
Disallow: /www/cgi-bin/post.cgi?action=reply
#In this case, only the URL's containing the "reply" parameter will be excluded from search. Such URL's will be: etc.

The following URL's will be crawled:

5- To disallow folder named "web" but to allow it's subfolders named "Webone" and "webtwo".

user-agent: *
Disallow: /web/
Allow: /web/webone/
Allow: /web/webtwo/   

6- To enable crawling of the site for googlebot but disallow crawling for Bingbot:

user-agent: googlebot
Allow: /

user-agent: bingbot
Disallow: /

7- To block all URL's with the word "froogle" followed by underscore

Disallow: /froogle_

8- To block the search folder from crawling:

User-agent: *
Disallow: /search

Should You Block Duplicate Pages Using Robots.txt?

Listen to what Matt has to say:

Use of Wildcards in Robots.txt (More Examples)

A wildcard is a character denoted by the asterisk sign (*) which can be used as a substitute of any of the subset of the matching characters. You can use wildcards for allowing or excluding the indexing a large set of specific URL's.

Google, Bing, Yahoo, and Ask support a limited form of "wildcards" for path values. These are:

* designates 0 or more instances of any valid character
$ designates the end of the URL

1- To block all URL's containing question mark in them 

user-agent: *
Disallow: /*?

2- To block all URL's starting with ebooks followed by ? and containing the parameter q followed by =

user-agent: *
Disallow: /ebooks?*q=*

## This will block the following URL's:

/ebooks?q=parameter etc.

but, will allow the following ones:

/pamper?q=parameter etc.

3- To exclude all URL's that end with .jpeg

User-agent: Googlebot
Disallow: /*.jpeg$

4- To exclude all URL's that end with .gif

User-agent: Googlebot
Disallow: /*.gif$

How to Create Your File?

The best option is to open a notepad and type the instructions directly into it. Thereafter, save the notepad file by the name of robots. Saving in notepad will add the .txt extension automatically on the file so you don't need to name the file as robots.txt while saving it. After the file is created, upload it on the root server so that the file can be fetched using the below address:

Another way is by making use of online tools for generating robots.txt files. Such tools are listed below:

How to Test Your File?

The best way is to login to your webmasters account and make changes on the robots.txt file there. You can preview the changes and see whether the file is working properly or not. 

Test your robots.txt Using Google Webmasters Tools:

  1. From the Webmaster Tools Home page, choose the site whose robots.txt file you want to test.
  2. Select the robots.txt Tester tool From the Crawl heading.
  3. Make changes to your live robots.txt with the help of the text editor.
  4. Correct the syntax warnings and logic errors if shown.
  5. Type in an extension of the URL or path in the text box at the bottom of the page.
  6. Select the user-agent you want to test.
  7. Click the TEST next to the dropdown user-agent list to run the simulation.
  8. Check to see if TEST button now reads ACCEPTED or BLOCKED to find out if the URL you entered is blocked from Google web crawlers.
Else, there are several online tools that will help you to check your robots.txt file. Below are some of the recommended tools:

Can You Use Robots.txt to Optimize Googlebot's Crawl?

Matt Cutt's advice:

Limitations of Robots.txt

  • Despite the use of clear instructions, the protocol is advisory meaning that it may or may not be followed by the search engine crawlers. It is like a rule book mentioning the instructions to be followed, some good robot programs will read and follow those instructions while the other bad ones will avoid it. 
  • Robots.txt is a public file, henceforth every instruction specified here is made public. This means all of your secret folder names are made public and are available before hackers. All the instructions are disclosed publicly.
  • There is no official standards body that guides the usage of robots.txt protocol.

Noindex Meta Tag vs Robots.txt

Webmasters have the choice of using either robots.txt file or the Noindex meta tag to block URL's. Similar to the robots.txt file, the Noindex meta tag instructs the robots to stay away from specific URL's. The Noindex meta tag should be assigned individually for each page.

An example is provided below:

<meta name="robots" content="noindex" />

## This tag will go in the head section of every URL that needs to be blocked.

Things to Keep in mind:

1- Noindex meta tag will tell the crawlers not to index the contents of the page. The crawlers will read the contents of the page, pass the link juices if any but will not index the page.

2- With robots.txt file, there are chances that the URL is crawled by the search engines and gets displayed in the search results without any snippet because that remains blocked from the search engines.

3- The best way to remove URL's from Google index is through the use of Noindex meta tags or with the help of the Google URL removal tool.

4- Only one "Disallow:" line is allowed for each URL in robots.txt file.

5- If you have several subdomains on your site then all those subdomains will require separate robots.txt files.

Also See:

How to Find Out the Total Number of Pages Blocked by Robots
51 Secrets You Didn't Knew About Google
50+ Seo Tips
Rich Snippets in Google
How to Add Ratings and Review Stars on Google Search Results
Query Highlighting on Google Search Results
List of Google Search Operators                                                                                                              Google Tag Manager
5 Ways to Fix Duplicate Content Issue
How to Set Up a Custom 404 Page
Post a Comment