• Home
  • Advertising
  • Artificial Intelligence
  • Google Products
  • Microsoft BING
  • Online Marketing
  • Random Stuff
  • Search Engine Optimization
  • Social Networks
  • Web Development
  • Website Design

Indexing by Google & Other Search Engines

Search Engine Optimization Aug 01, 2008
Indexing by Google & Other Search Engines

Process of Website Indexing by Google & other Search Engines

There is a lot of speculation about how search engines index websites.

The exact workings of the search engine indexing process are shrouded in mystery since most search engines offer limited information to web admins about how they architect the indexing process.

Web admins get some clues by checking their log reports about the crawler visit, but they are unaware of how the indexing happens or which pages of their website were really crawled.

While the speculation about the search engine indexing process may continue, here is a theory, based on experience, research, and clues, about how they may be going about indexing 8 to 10 billion web pages even so often, or why newly added pages are delayed in showing up in their index.

This discussion is centered around Google, but we believe that most popular search engines like Yahoo and MSN follow a similar pattern.

Google runs from Internet Data Centers (IDCs)

Google has over 200 (some think ” over 100″) crawlers/bots scanning the web each day. These do not necessarily follow an exclusive pattern, which means different crawlers may visit the same site on the same day without knowing other crawlers have been there before.

This is what probably gives a ” daily visit” record in your traffic log reports, keeping web masters very happy about their frequent visits.

Some crawlers’ jobs are only to grab new URLs (let’s call them ” URL Grabber” for convenience) – The URL grabbers grab links & URLs they detect on various websites (including links pointing to your site) and old/new URLs they detect on your site.

They also capture the ” date stamp” of files when they visit your website so that they can identify pages with ” new content” or”updated content.”

The URL grabbers reinclude/exclude.txt file and Tags so that they can include/exclude URLs you want/do not want indexed. (Note: The same URL with different session IDs is recorded as different ” unique ” URLs.

For this reason, session IDs are best avoided. Otherwise, they can be misinterpreted as duplicate content.

The URL grabbers spend relatively little time and bandwidth on your website since their job is relatively simple. However, so that you know, they need to scan 8 to 10 billion URLs on the web each month. This is not a petty job in itself, even for 1000 crawlers.

The URL grabbers write the captured URLs with their date stamps and other status in a ” Master URL List” so that these can be deep-indexed by other special crawlers.

The master list is then processed and classified somewhat like –

a) New URLs detected
b) Old URLs with new date stamp
c) 301 & 302 redirected URLs
d) Old URLs with old date stamp
e) 404 error URLs
f) Other URLs

The real indexing is done by what we’re calling a Deep Crawler. A deep crawler’s job is to pick up URLs from the master list, deep crawl each URL, and capture all the content—text, HTML, images, flash, etc.

Priority is given to ” Old URLs with new date stamp’’ as they relate to already indexed but updated content. ‘301 & 302 redirectedURLs’’ come next in prior,ity followed by’‘New URLs detecte’’.

High priority is given to URLs whose links appear on several other sites.

These are classified as “important” URLs. Sites and URLs whose date stamp and content change daily or hourly are stamped as ‘ News’ sites, which are indexed hourly or even minute-by-minute.

The indexing of ‘Old URLs with old date stamps’ and ‘404 error URL’ is entirely ignored. There is no point in wasting resources indexing ” Old URLs with old date stamps” since the search engine already has the content indexed, which has not yet been updated.

‘ 404 error URL’ is a term for URLs collected from various sites that are bros or error pages. These URLs do not show any content.

The ” Other URL” may contain dynamic URLs, URLs with session IDs of the following.

1. PDF documents
2. Word documents
3. PowerPoint presentations
4. Multimedia files
5. RSS
6. Video

Google needs to process these further and assess which ones are worth indexing and to what depth. It may allocate the indexing task to ” Special Crawler.”

When Google schedules the Deep Crawler to index a New URL and a ‘301 or 302 redirected URL, just the URLs (not the descriptions) start appearing in search engines’ result pages when you run the search” site:www.domain.co” in Google.

Since Deep Crawlers need to crawl billions of web pages each month, they take as many as 4 to 8 weeks to index even updated content. New URLs may take longer to index.

Once the Deep Crawlers index the content, it goes into their originating IDCs. Content is then processed, sorted, and replicated (synchronized) to the rest of the IDCs. A few years back, when the data size was manageable, this data synchronization used to happen once a month, lasting for five days, called ” Google Dance.”

When you hit www.google.com from your browser, you can land at any of their 10 IDCs depending on their speed and availability. Since the data at any given time is slightly different at each IDC, you may get different results at other times or on repeated searches of the same term (Google Dance).

The bottom line is that it may take 8 to 12 weeks to see full indexing in Google. One should consider this ” cooking time in Google’s kitchen.”

Unless you can increase the ” importance” of your web pages by getting several incoming links from good sites, you cannot speed up the indexing process unless you know Sergey Brin and Larry Page and have a significant influence over them.

Dynamic URLs may take longer to index (sometimes they do not get indexed at all) since even small data sets can create unlimited URLs, which can clutter Google’s index with duplicate content.

Summary & Advice:

Ensure that you have cleared all roadblocks for crawlers and that they can freely visit your site and capture all URLs. Help crawlers by creating good interlinking and sitemaps on your website.

Get lots of good incoming links to your pages from other websites to improve their importance. There is no special need to submit your website to search engines. Links to your website and other websites are sufficient.

Patiently wait for 4 to 12 weeks for the indexing to happen. Disclaimer: The actual functioning and exact architecture of the search engines may vary, but in essence, this is what we believe they do.

Also see Advanced SEO Tips

Post excerpts from Atul Gupta, the founder & CEO of RedAlkemi

Share This Post
Facebook Twitter Linkedin Google+
Newer Older

Archives

  • May 2025 (3)
  • April 2025 (5)
  • March 2025 (1)
  • November 2024 (1)
  • June 2023 (1)
  • August 2021 (1)
  • April 2021 (1)
  • March 2021 (3)
  • February 2021 (1)
  • April 2020 (1)
  • November 2019 (1)
  • May 2019 (6)
  • September 2018 (1)
  • May 2018 (1)
  • April 2018 (1)
  • November 2016 (1)
  • August 2016 (1)
  • April 2016 (1)
  • November 2015 (1)
  • October 2015 (1)
  • September 2015 (2)
  • February 2014 (1)
  • October 2013 (2)
  • September 2013 (1)
  • August 2013 (1)
  • June 2013 (2)
  • May 2013 (1)
  • February 2013 (2)
  • September 2012 (2)
  • August 2012 (3)
  • May 2012 (2)
  • November 2011 (1)
  • August 2011 (1)
  • July 2011 (1)
  • June 2011 (1)
  • May 2011 (3)
  • March 2011 (1)
  • December 2010 (2)
  • August 2010 (3)
  • July 2010 (2)
  • October 2009 (2)
  • July 2009 (1)
  • October 2008 (2)
  • August 2008 (3)
  • July 2008 (1)
  • May 2008 (2)
  • April 2008 (1)
  • January 2008 (1)
  • November 2007 (1)
  • September 2007 (1)
  • July 2007 (2)
  • June 2007 (3)
  • January 2007 (2)
  • December 2006 (3)
  • October 2006 (1)
  • September 2006 (2)
  • August 2006 (1)
  • July 2006 (2)
  • June 2006 (2)
  • May 2006 (2)
  • April 2006 (5)
  • March 2006 (2)
  • February 2006 (3)
  • January 2006 (3)
  • December 2005 (3)
  • November 2005 (7)
  • October 2005 (8)
  • September 2005 (5)
  • August 2005 (5)
  • July 2005 (1)
  • February 2005 (3)
  • January 2005 (1)