creating a robots.txt file internet marketing
search engine course Introduction Getting
Started
Web Site
Basics
Marketing
Basics
Marketing
Dictionary
Marketing
Resources
 
 

Creating a robots.txt file
By Sumantra Roy

Some people believe that they should create different pages for different search engines, each page optimized for one keyword and for one search engine. Now, while I don't recommend that people create different pages for different search engines, if you do decide to create such pages, there is one issue that you need to be aware of.

These pages, although optimized for different search engines, often turn out to be pretty similar to each other. The search engines now have the ability to detect when a site has created such similar looking pages and are penalizing or even banning such sites. In order to prevent your site from being penalized for spamming, you need to prevent the search engine spiders from indexing pages which are not meant for it, i.e. you need to prevent AltaVista from indexing pages meant for Excite and vice-versa. The best way to do that is to use a robots.txt file.

You should create a robots.txt file using a text editor like Windows Notepad. Don't use your word processor to create such a file.

Here is the basic syntax of the robots.txt file:

User-Agent: [Spider Name]
Disallow: [File Name]

For instance, to tell AltaVista's spider, Scooter, not to spider the file named myfile1.html residing in the root directory of the server, you would write

User-Agent: Scooter
Disallow: /myfile1.html

To tell Excite's spider, called ArchitextSpider, not to spider the files myfile2.html and myfile3.html, you would write

User-Agent: ArchitextSpider
Disallow: /myfile2.html
Disallow: /myfile3.html

You can, of course, put multiple User-Agent statements in the same robots.txt file. Hence, to tell AltaVista not to spider the file named myfile1.html, and to tell Excite not to spider the files myfile2.html and myfile3.html, you would write

User-Agent: Scooter
Disallow: /myfile1.html

User-Agent: ArchitextSpider
Disallow: /myfile2.html
Disallow: /myfile3.html

If you want to prevent all robots from spidering the file named myfile4.html, you can use the * wildcard character in the User-Agent line, i.e. you would write

User-Agent: *
Disallow: /myfile4.html

However, you cannot use the wildcard character in the Disallow line.

Once you have created the robots.txt file, you should upload it to the root directory of your domain. Uploading it to any sub-directory won't work - the robots.txt file needs to be in the root directory.

I won't discuss the syntax and structure of the robots.txt file any further - you can get the complete specifications from http://www.robotstxt.org/wc/norobots.html

Now we come to how the robots.txt file can be used to prevent your site from being penalized for spamming in case you are creating different pages for different search engines. What you need to do is to prevent each search engine from spidering pages which are not meant for it.

For simplicity, let's assume that you are targeting only two keywords: "tourism in Australia" and "travel to Australia". Also, let's assume that you are targeting only four of the major search engines: AltaVista, Excite, HotBot and Northern Light.

Now, suppose you have followed the following convention for naming the files: Each page is named by separating the individual words of the keyword for which the page is being optimized by hyphens. To this is added the first two letters of the name of the search engine for which the page is being optimized.

Hence, the files for AltaVista are

tourism-in-australia-al.html
travel-to-australia-al.html

The files for Excite are

tourism-in-australia-ex.html
travel-to-australia-ex.html

The files for HotBot are

tourism-in-australia-ho.html
travel-to-australia-ho.html

The files for Northern Light are

tourism-in-australia-no.html
travel-to-australia-no.html

As I noted earlier, AltaVista's spider is called Scooter and Excite's spider is called ArchitextSpider.

A list of spiders for the major search engines can be found at http://www.searchenginewatch.com/webmasters/spiderchart.html

From this list, we find that the spider for Northern Light is called Gulliver. HotBot uses Inktomi and Inktomi's spider is called Slurp. Using this knowledge, here's what the robots.txt file should contain:

User-Agent: Scooter
Disallow: /tourism-in-australia-ex.html
Disallow: /travel-to-australia-ex.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
Disallow: /tourism-in-australia-no.html
Disallow: /travel-to-australia-no.html

User-Agent: ArchitextSpider
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
Disallow: /tourism-in-australia-no.html
Disallow: /travel-to-australia-no.html

User-Agent: Slurp
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ex.html
Disallow: /travel-to-australia-ex.html
Disallow: /tourism-in-australia-no.html
Disallow: /travel-to-australia-no.html

User-Agent: Gulliver
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ex.html
Disallow: /travel-to-australia-ex.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html

When you put the above lines in the robots.txt file, you instruct each search engine not to spider the files meant for the other search engines.

When you have finished creating the robots.txt file, double-check to ensure that you have not made any errors anywhere in it. A small error can have disastrous consequences - a search engine may spider files which are not meant for it, in which case it can penalize your site for spamming, or, it may not spider any files at all, in which case you won't get top rankings in that search engine.

An useful tool to check the syntax of your robots.txt file can be found at http://www.tardis.ed.ac.uk/~sxw/robots/check/. While it will help you correct syntactical errors in the robots.txt file, it won't help you correct any logical errors, for which you will still need to go through the robots.txt thoroughly, as mentioned above.

Article by Sumantra Roy. Sumantra is one of the most respected search engine positioning specialists on the Internet. To have Sumantra's company place your site at the top of the search engines, go to http://www.1stSearchRanking.com/t.cgi?1427 For more advice on how you can take your web site to the top of the search engines, subscribe to his FREE newsletter by going to http://www.1stSearchRanking.com/t.cgi?1427&newsletter.htm






Related Links



 
 
 
Marketing Basics

Marketing Basics
 5 Proven Ways
 Free Vs. Paid
 Top 10 Blunders

Promotion Tips
 42 Deadly Ad Copy Sins
 Auctions
 Banner Advertising
 Banner Advertising 2
 Banner Advertising 3
 E-Zines
 Extra 108,160 Hits
 Reciprocal Links
 Remind Your Customers
 Repeat Sales With Follow-up
 Search Engines
 Spy On Your Competitors
 Triggers Of Success
 Viral Marketing
 Why Some Ads Succeed

Search Engine Course
 01 The Tools of the Trade
 02 Choosing Keywords
 03 Your Own Domain Name
 04 How To Choose Domain Names
 05 Keyword Rich Pages
 06 Creating a robots.txt file
 07 Software to create robots.txt
 08 Measuring Link Popularity
 09 Improving Link Popularity
 10 The Open Directory
 11 Open Directory Editor
 12 Submitting to Yahoo
 13 Frames
 14 Page Cloaking
 15 Top 10 Mistakes
 16 Glossary
 


 
Affiliate Marketing

Affiliate Elite
Affiliate Marketing Software

 
Auction Marketing

SaleHoo Wholesale Directory

The Silent Sales Machine

 
eBook Publishing

Activ eBook Compiler
Create Your Own eBook

Cover Factory
Create Your eBook Cover Images

eBook in 7 Days

 
Search Engines/Advertising

SEO Elite
The Grand Daddy Of All SEO Software!

Web Traffic Machines

 
Web Design

Dreamweaver Tutorials

WebHostingPicks.com
Need a web host for your site? Find the right one for you.

 



With any business, it is up to the individual owner of said business to ensure the success of the business. You may make more or less than any sample figures or results that might be quoted on our web sites or other publications. All business involves risk, and many businesses do not succeed. Further, Answers 2000 Limited does NOT represent that any particular individual or business is typical, or that any results or experiences achieved by any particular individual/business is necessarily typical.

Copyright © 2000-2008, Answers 2000 Limited
All third party content and adverts are copyright of their respective owners.

About/Terms Of Use  Privacy  Site Map

Click here to get The Butterfly Marketing Manuscript