W5P1: Search Engine Tools – John Larson & UNLV Continuing Education

Common Search Engine Protocols

Search engines want websites and content in certain formats to make their job easier so they provide a wide variety of help including guidance, tools, and analytics. These free resources provide webmasters (and SEO’s) with an opportunity to exchange information with the search engines that is not available anywhere else. Below are some of the more common elements that each of the search engines support and why that are useful…

Sitemaps

Sitemaps allow webmasters (and SEO’s) an opportunity to help search engines crawl and classify content that they may not have found on their own by giving the location on where content is located within the website. Sitemaps can also highlights different types of content including video, images, news, and mobile versions.

READ: For a complete explanation of sitemaps and sitemap protocols, check out Sitemaps.org. They describe the XML schema in detail, offer examples, and supply answers to frequently asked questions.

READ: Building sitemaps can be very easy or quite complex depending on the complexity of your website and your needs. To assist you with the process, Reid Bandremer (lunametrics.com) wrote this article: Building the Ultimate XML Sitemap. This 10 step guide focuses on how to utilize sitemaps as an SEO tool that will address the specific needs of your website and the SEO campaign.

Sitemaps come in three varieties:

XML (Extensible Markup Language) This is the most commonly accepted and therefore, recommended format for sitemaps. XML is easy to create using one of the many sitemap generators and offers search engines an easy way to parse website information while allowing webmasters the greatest flexibility for controlling page parameters.

RSS (Really Simple Syndication or Rich Site Summary) RSS sitemaps can be coded to automatically update when new content is added to the website allowing for easy maintenance. RSS is a dialect of XML but it’s much harder to manage due to its updating properties.

Text File The text sitemap format is one URL per line up to 50,000 lines. The text sitemap is extremely easy to create but does not provide the ability to add meta data to pages.

Robots.txt File

The robots.txt file is a file stored on a website’s root directory that gives instructions to automated search engine spiders. Developed by the Robots Exclusion Protocol, robots.txt files allow webmasters the ability to indicate which areas of a website they would like to disallow bots from crawling as well as indicate the locations of sitemap files and crawl-delay parameters.

READ: Aaron Wall (SEOBook) offers and a detailed tutorial on robots.txt files including background information, formatting examples, how to set crawl delays, tips and tricks, and tools to analyze existing robots.txt files.

The following commands are available:

Disallow Prevents compliant robots from accessing specific pages or folders.

Sitemap Indicates the location of a website’s sitemap or sitemaps.

Crawl Delay Indicates the speed (in milliseconds) at which a robot can crawl a server. Two major crawlers – Yahoo (“slurp”) and Bing (“msnbot”) support the “crawl-delay” directive in robots.txt.

Example of Robots.txt

#Robots.txt www.yourdomain.com/robots.txt
User-agent: *
Disallow:

# Don’t allow spambot to crawl any pages
User-agent: spambot
disallow: /

# Set server delay in milliseconds
User-agent: *
Crawl-delay: 3

sitemap:www.yourdomain.com/sitemap.xml

Caution Not all web spiders follow robots.txt instructions. Some people with bad intentions (i.e. e-mail address scrapers, personal information scrapers, etc) purposely build search bots that don’t follow this protocol. It is recommended that the location of administration or private sections not be included in the robots.txt. It would be a better practice to use the meta robots tag to keep the major search engines from indexing high risk content areas.

READ: Sometimes you’ll want to block a specific search spider from crawling your website. Here is a search directory of search spiders by name and search engine and their current status (active versus inactive).

Meta Robots Tag

The meta robots tag allows webmasters the ability to create page-level instructions for search engine spiders. Each search engine offers resources on how to use the tag:

Google Webmaster Central Blog – Using the robots meta tag.
Bing Webmaster Help – Robots Metatags.
Yahoo Webmaster articles – How to keep your pages from being cached in Yahoo Search.
Danny Sullivan (Search Engine Land) – Meta Robots Tag 101: Blocking Spiders, Cached Pages & More.

The meta robots tag should be included in the head section of an HTML document.

Example of Meta Robots

<html>
<head>
<title>This is an example title</title>
<meta name=”ROBOTS” content=”NOINDEX, NOFOLLOW”>
</head>
<body>
<h1>Hello World</h1>
</body>
</html>

In the example above, “NOINDEX, NOFOLLOW” tells robots not to include the given page in their indexes, and also not to follow any of the links on the page.

Robots Meta Instructions

Robot Name can be either “robots” for all robots or the user-agent of a specific robot.

noindex: Google, Yahoo, Ask, Bing – Page not indexed
nofollow: Google, Yahoo, Ask, Bing – All links on page become “nofollow”
noarchieve: Google, Yahoo, Ask, Bing – Page not cached
noodp: Google, Yahoo, Bing – Stops description and title tag overwrite from DMOZ (only for Homepage)
noydir: Yahoo – Stops description and title tag overwrite by Yahoo Directory
nosnippet: Google – Stops Google from generating description based on page text

Rel=”nofollow”

Do you remember how links were described as “votes” to a websites popularity? The rel=nofollow attribute allows you to link to a resource while removing your “vote” for SEO purposes and are useful when you are linking to an untrusted source. The “nofollow” tag tells search engines not to follow the link, however, some search engines still follow these links looking for new web pages.

WATCH: In this video, Matt Cutts (Head of Search Spam at Google) offers this advice on how to use the rel=”nofollow attribute to limit “votes” to other websites and its use in certain situations when linking to unknown websites (or websites with unknown link neighbors).

Example of nofollow

<a href=”http://www.yourdomain.com” title=“Example” rel=”nofollow”>Example Link</a>

In the example above, the value of the link would not be passed to yourdomain.com as the rel=nofollow attribute has been added.

Rel=”canonical”

Often, two or more copies of the exact same content appears on your website under different URLs.

For example, the following URLs can all refer to a single homepage:

http://www.example.com/ http://www.example.com/default.asp http://example.com/ http://example.com/default.asp http://Example.com/Default.asp

You may think all of these address are referring to the home page (and they are), but search engines see these as five separate web pages. Unfortunately, the content is identical on each page (URL address) and this can cause the search engines to devalue the website’s content and its potential rankings.

Using the canonical tag will solve this problem by informing search engine spiders which web page is the “authoritative” version that should be counted in the search query results.

Example of rel=”canonical”

For the URL http://yourdomain.com/default.asp

<html>
<head>
<title>This is an example title</title>
<link rel=”canonical” href=”http://www.yourdomain.com”>
</head>
<body>
<h1>Hello World</h1>
</body>
</html>

In the example above, rel=canonical tells robots that this page is a copy of http://www.yourdomain.com, and should consider the latter URL as the canonical.

NEXT: Search Engine Services