What is Robot
When you submit a query to search engine such as Google, you get a very large list of websites which provide information related to your query but you ever wonder how this list get collected by search engines. To find information on the hundreds of millions of Web pages that exist, a search engine employs automated script/program that crawls through the web and collect data from websites… This automated script/ program is called Robot, spider or crawler.
Usually search engine try to search all pages from websites but some websites contain few pages which are not required to index by search engines. If you do not want to index any page or directory than it should be specified to search engines. This can be done by two ways:
1. Robot Meta Tag
2. Robots.txt file
Robot Meta Tag
HTML <META> tag tell robots not to index the content of a page, and/or not scan it for links to follow.
<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>
There are two important considerations when using the robots <META> tag:
- robots can ignore your <META> tag. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
- the NOFOLLOW directive only applies to links on this page. It’s entirely likely that a robot might find the same links on some other page without a NOFOLLOW (perhaps on some other site), and so still arrives at your undesired page.
Don’t confuse this NOFOLLOW with the rel=”nofollow” link attribute.
Here’s a list of the values you can specify within the “contents” attribute of this tag:
|index||Allows indexing of the page.|
|noindex||Disallows indexing of the page.|
|follow||Instructs crawler to crawl links contained within that page.|
|nofollow||This disallows following of links by a crawler on that specific page:|
|none||Don’t index nor follow links on page.|
<meta name="robots" content="noindex,nofollow" />
<meta name="robots" content="noindex,follow" />
<meta name="robots" content="index,nofollow" />
<meta name="robots" content="none">
Robots.txt is a regular text file that has special meaning to the majority of search engines on the web. By defining a few rules in this text file, you can instruct robots to not crawl and index certain files, directories within your site, or at all. For example, you may not want Google to crawl the /images directory of your site, as it’s both meaningless to you and a waste of your site’s bandwidth. “Robots.txt” lets you tell Google just that.
You need a robots.txt file only if your site includes content that you don’t want search engines to index. If you want search engines to index everything in your site, you don’t need a robots.txt file (not even an empty one).
How to create robots.txt file
- Create a text file exactly by the name “robots.txt”
- Write Content.
- Save this file on the root directory of your website and not a subdirectory.
The format of the content of the file is very simple. It contains two rules:
- User-agent: the robot the following rule applies to
- Disallow: the URL you want to block
These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines and multiple user-agents in one entry.
- To block the entire site:
- To block a directory and everything in it:
- To block a page:
- To remove a specific image from Google Images:
- To remove all images on your site from Google Images:
- To block files of a specific file type (for example, .gif), use the following:
- To exclude robot from part of the server:
- To exclude all files except one:
Save your robots.txt file on the root directory of your website.
Test a robots.txt file
The Test robots.txt tool will show you if your robots.txt file is accidentally blocking Googlebot from a file or directory on your site, or if it’s permitting Googlebot to crawl files that should not appear on the web. When you enter the text of a proposed robots.txt file, the tool reads it in the same way Googlebot does, and lists the effects of the file and any problems found.
Test a site’s robots.txt file:
- On the Google Webmaster Tools Home page, click the site you want.
- Under Health, click Blocked URLs.
- If it’s not already selected, click the Test robots.txt tab.
- Copy the content of your robots.txt file, and paste it into the first box.
- In the URLs box, list the site to test against.
- In the User-agents list, select the user-agents you want.
Any changes you make in this tool will not be saved. To save any changes, you’ll need to copy the contents and paste them into your robots.txt file.
Difference between Robot Meta Tag & robots.txt
- The Robots META tag does exactly the same thing as the robots.txt file – but it is not as reliable. Not all robots honour the robots meta tag. Use it if your site is in a subdirectory like www.yourdomain.com/users/mypage/ and you can’t get the server administrator to add changes to robots.txt file.
- When you block URLs from being indexed in Google via robots.txt, they may still show those pages if they are listed somewhere else on the web. A better solution for completely blocking the index of a particular page is to use a robots noindex meta tag on a per page bases