• Home
  • Advertising
  • Bing Products and SEO
  • Building Presence on Social Networks
  • Google Products
  • Online Marketing
  • Random Stuff
  • Search Engine Optimization
  • Web Development
  • Website Design

Search Engine Spiders

Search Engine Optimization Aug 23, 2005
Search Engine Spiders

Robots.txt Signpost Warns Search Engine Spiders From Private Property

The robots.txt file is a standard required by all web crawlers and robots to tell them which files and directories you want them to stay out of on your site.

Not all crawlers or bots follow the exclusion standard and will continue to crawl your site anyway. I like to call them “Bad Bots” or trespassers. We block them by IP exclusion, which is another story entirely.

This is a straightforward overview of the basics of robots.txt for web administrators. For a complete and thorough lesson, visit Robotstxt.org.

To see the proper format for a somewhat standard robots.txt file, look directly below.

That file should be at the root of the domain because that is where the crawlers expect it to be, not in some secondary directory.

Below is the proper format for a robots.txt file —–>

User-agent: *Disallow: /cgi-bin/Disallow: /images/Disallow: /group/
User-agent: msnbotCrawl-delay: 10
User-agent: TeomaCrawl-delay: 10
User-agent: SlurpCrawl-delay: 10
User-agent: aipbotDisallow: /
User-agent: BecomeBotDisallow: /
User-agent: psbotDisallow: /

——–> End of robots.txt file

This tiny text file is saved as a plain text document and is always named “robots.txt” in the root of your domain.

A quick review of the information listed in the robots.txt file above follows. The “User Agent: MSNbot” is from MSN, Slurp is from Yahoo!, and Teoma is from Ask Jeeves. The others listed are “Bad” bots that crawl very fast and to nobody’s benefit but their own, so we ask them to stay out entirelyThe asterisk (*))k is a wildcard that means “All” crawlers /spiders, o /bots should stay out of the group of files or directories listed.

The bots given the instruction “Disallow: /” mean they should stay out entirely.

Those with “Crawl-delay: 10” are bots that crawled our site too quickly, causing it to bog down and overuse server resources. Google crawls more slowly than the others and doesn’t require that instruction, so it is not explicitly listed in the robots.txt file above. The crawl-delay instruction is only needed on huge sites with hundreds or thousands of pages. The wildcard asterisk (*) applies to all crawlers, bots, and spiders, including Googlebot.

Those we provided the “Crawl-delay: 10” instruction to were requesting as many as seven pages every second, so we asked them to slow down.

The number you see is seconds, and you can change it to suit your server capacity, based on their crawling rate. A ten-second delay between page requests is far more leisurely and prevents them from requesting more pages than your server can handle.

(You can discover how fast robots and spiders are crawling by looking at your raw server logs, which show pages requested by precise times to within a hundredth of a second – available from your web host or ask your web or IT person. Your server logs can be found in the root directory. If you have server access, you can usually download compressed server log files by calendar day directly from your server. You’ll need a utility that can expand compressed files to open and read plain text, raw server log files.

You can see the contents of any website’s robots.txt file that way.

The robots.txt shown above is what we currently use at Publish101 Web Content Distributor, which was launched in May 2005. We did an extensive case study and published a series of articles on crawler behavior and indexing delays known as the Google Sandbox. The Google Sandbox Case Study is highly instructive on many levels for web administrators everywhere about the importance of this often-ignored little text file.

One thing we didn’t expect to learn from the research on indexing delays.

Known as the Google Sandbox was the importance of robots.txt files to quick and efficient crawling by the spiders from the major search engines and the number of heavy crawls from bots that will do no earthly good to the site owner, yet crawl most sites extensively and heavily, straining servers to the breaking point with requests for pages coming as fast as seven pages per second.

We discovered during the launch of our new site that Google and Yahoo will crawl the site regardless of whether you use a robots.txt file.

However, MSN seems to require it before they will begin crawling at all. All of the search engine robots seem to request the file on a regular basis to verify that it hasn’t changed.

Then, when you do change it, they will stop crawling for brief periods and repeatedly ask for the robots.txt file during that time, without crawling any additional pages. (Perhaps they had a list of pages to visit that included the directory or files you have instructed them to stay out of, and must now adjust their crawling schedule to eliminate those files from their list.)

Most webmasters instruct bots to stay out of “image” directories and the “cgi-bin” directory, as well as any directories containing private or proprietary files intended only for users of an intranet or password-protected sections of your site. Clearly, you should direct the bots to stay out of any private areas that you don’t want indexed by the search engines.

Average web admins rarely discuss the importance of robots.txt

I’ve even had some of my clients’ web admins ask me what it is and how to implement it, despite my telling them how important it is to both site security and efficient crawling by search engines. This should be standard knowledge among web admins at substantial companies, but this illustrates how little attention is paid to using robots.txt.

The search engine spiders really do want your guidance, and this tiny text file is the best way to provide crawlers and bots a clear signpost to warn off trespassers and protect private property – and to warmly welcome guests, such as the big three search engines, while asking them nicely to stay out of private areas.

Posted by Mike Banks Valentine

Share This Post
Facebook Twitter Linkedin Google+
Newer Older

Comment (1)

  1. Tony

    21 Dec 2005 - 6:15 am

    I’ve really enjoyed reading your blog. Very interesting.

Comments are closed.

Archives

  • May 2025 (1)
  • April 2025 (5)
  • March 2025 (1)
  • November 2024 (1)
  • June 2023 (1)
  • August 2021 (1)
  • April 2021 (1)
  • March 2021 (3)
  • February 2021 (1)
  • April 2020 (1)
  • November 2019 (1)
  • May 2019 (6)
  • September 2018 (1)
  • May 2018 (1)
  • April 2018 (1)
  • November 2016 (1)
  • August 2016 (1)
  • April 2016 (1)
  • November 2015 (1)
  • October 2015 (1)
  • September 2015 (2)
  • February 2014 (1)
  • October 2013 (2)
  • September 2013 (1)
  • August 2013 (1)
  • June 2013 (2)
  • May 2013 (1)
  • February 2013 (2)
  • September 2012 (2)
  • August 2012 (3)
  • May 2012 (2)
  • November 2011 (1)
  • August 2011 (1)
  • July 2011 (1)
  • June 2011 (1)
  • May 2011 (3)
  • March 2011 (1)
  • December 2010 (2)
  • August 2010 (3)
  • July 2010 (2)
  • October 2009 (2)
  • July 2009 (1)
  • October 2008 (2)
  • August 2008 (3)
  • July 2008 (1)
  • May 2008 (2)
  • April 2008 (1)
  • January 2008 (1)
  • November 2007 (1)
  • September 2007 (1)
  • July 2007 (2)
  • June 2007 (3)
  • January 2007 (2)
  • December 2006 (3)
  • October 2006 (1)
  • September 2006 (2)
  • August 2006 (1)
  • July 2006 (2)
  • June 2006 (2)
  • May 2006 (2)
  • April 2006 (5)
  • March 2006 (2)
  • February 2006 (3)
  • January 2006 (3)
  • December 2005 (3)
  • November 2005 (7)
  • October 2005 (8)
  • September 2005 (5)
  • August 2005 (5)
  • July 2005 (1)
  • February 2005 (3)
  • January 2005 (1)