Ian Ring

Be a Normalizer – a C14N Exterminator

by | Guestwhore Posts

Hi Sugarrae readers! I’ve prepared something a little less fun today. To make my guest post worthwhile (and to donate about 3,000 words of heavy SEO material to Rae’s blog), we’re going back to SEO Kindergarten to review one of the most basic tenets of SEO – canonicalization (C14N) – in excruciating detail. Examples here include excerpts from one chapter of Advanced URL Management, a work in progress currently looking for a publisher. (If anyone from O’Reilly is reading this – let’s talk)

If you’re in a hurry, skip to the good part. Otherwise, read on. If you’re not interested in SEO, go watch me put mentos up my nose and immerse my head in Diet Coke.

Rae reminded all us whores not to forget introductions… Hello. My name is Ian Ring. I met Sugarrae during a stint working in Guelph a couple of years ago. I build websites long time. My interests include random poetry, building facebook apps, and eating sandwiches. Some of you may know me by a pseudonym. Let’s move on.

Become a Canonical URL Exterminator

Canonical issues arise when more than one URL may be used to deliver the same resource. Most canonical problems are caused by forgiving agents and overly tolerant servers – given a set of rules, these machines assume that even though you asked for B, you really wanted A. So it gives you the content of A, as though it is actually B. Fixing canonical bugs is achieved by the systematic removal or overriding of those assumptions.

The problem of canonical URLs is fairly basic SEO 101 stuff, and anyone who has been reading Rae’s blog is probably yawningly aware of what they are, why they’re bad, and so on. Yet even some experienced SEOs, while they know a Canonical Error when they see one, don’t know where to look to find them. Or they’ll know a few common problems, but they don’t have a systematic way to sweep a site looking for every possible canonicalization vulnerability.

My opinions reflect an assertion that bots are stupid. I’m not accusing any particular bot or engine, and actually the major engines are pretty good at interpreting URLs, avoiding double-indexing, and they generally handle URL normalization intelligently, notwithstanding some documented historical fuckups. But face it – when a page is badly indexed, the fault is rarely with the engine. If your URL architecture is perfect, there’ll be nothing for them to screw up. Indexing problems are ultimately the result of a webmaster not applying diligence in URL management, and/or placing unwarranted faith in bots not to be stupid.

Let’s define this as tenet #1: Bots are stupid. Assume this to be true a priori.

Canonical problems are elusive, because you’re not searching for something on the site that can be found by following links and running a spell-check. A Canonical error is one that you can only diagnose by “linking outside the box” – trying unusual, sometimes wacky, alternative URLs to see whether your site delivers dupe content.

To be certain that there are no Canonical problems on a site, you need to tackle the task systematically. Starting with the canonical URL, there are three ways that a URL can be modified. They are:

Omissions:

something should be there but isn’t.

Inclusions:

something is there that shouldn’t be.

Modifications:

it’s there, but it’s wrong.

These aberrations may appear in any of the URL parts: protocol, port, subdomain, domain, file path, file name, and querystring. I’ll omit the user, password, and fragment portions of a URL since they rarely cause SEO or canonicalization problems.

First I’ll start with a few general tips for canonicalization.

Canonicalization Tips

Tip 1: Just use lower case

… for everything. I take a conservative approach to c14n. I prefer everything* in lower-case, even where the HTTP spec doesn’t indicate case sensitivity. I don’t care if the HTTP spec says domains are case insensitive … if your server is set up to convert all URLs to lower case, it solves a whole set of canonical issues all at once.

As a side effect of forcing the entire URL into lower case, I even force letters in escape sequences to lower case, e.g. “%3A” (a urlencoded colon) becomes “%3a”. Purists lay off – the official HTTP spec says escape sequences are case-insensitive.

Some might argue that it doesn’t matter if you use “%3A” or “%3a” – the engines will interpret and normalize URLs properly. I reply: Bots are stupid. See tenet #1.

* exception: user-entered values in the querystring should not be converted, eg. in a search query

Tip 2: If there’s a default value for something, omit it.

Well, you can either omit it or force it – for instance I tend to force the default “www” subdomain onto URLs if it’s missing. I could just as easily force it to be absent. Whatever – choose one, and be consistent.

Check each of part of the URL for any default values that can be omitted. For example, your default HTTP port is probably 80. So when someone requests this:
http://www.example.com:80/page.php, it’s the same as requesting
http://www.example.com/page.php. If you receive a request for port 80, you should redirect to the canonical URL with the default port removed.

The same default value hunting follows through the rest of the URL parts – the default file name may be “index.php”. The default protocol is http:// … try requesting your canonical URL with and without each of these bits to see if it is:

  1. Canonical, and delivered OK as expected,
  2. Non-canonical, and resolved by your browser, thus doesn’t need any redirection (e.g. most browsers will add the “http” for you, if you omit it.)
  3. Non-canonical, and redirected by your server with a 301 Permanent Redirection status response, or
  4. Erroneous, and handled with a 404 Not Found error page.

Dealing with default values is pretty simple, everywhere except in the querystring. Querystrings are very easy to use badly, and even when they’re used correctly, they can be canonically weak.

Some might argue that canonicalization in the querystring is a non-issue. They suggest that all the search engines always interpret and normalize URLs with querystrings properly, and it doesn’t matter if your URL has some weird shit after the question mark. I reply: Bots are stupid. See tenet #1.

Consider a script that shows paginated data. Page one of the data may have a URL like:

http://www.example.com/page.php?p=1

In this hypothetical situation, our friendly developers have realized that dependence on a pagination variable is assumptive, so they’ve written their code to show page 1, if no “p” variable is supplied. In fact, they show page 1 data whenever p is not a positive integer above 0. The intent is good, and they get marks for writing code that “fails gracefully”, but a better solution would have been to force a 301 redirection to the canonical URL, rather than show the same content without complaint.

The developers don’t tell you these things. You (as SEO) need to go in, look, test, and identify whether the omission or inclusion of a default value affects canonicalization.

First, choose which you’re going to canonicalize: the URL with the value, or the URL without. If you’re having trouble deciding, let me help you – leave it out. Brief URLs are good. And if you always choose the shorter option, you’ll never need to remember what method you used in which situations – so the rule of thumb will be: if the value is default, omit it.

Where the default value of querystring variable “p” is “1”, a diligent canonical exterminator will know that all of these are potential canonical dupes, and will check them:

  • http://www.example.com/page.php
  • http://www.example.com/page.php?
  • http://www.example.com/page.php?p=
  • http://www.example.com/page.php?p=1
  • http://www.example.com/page.php?p=1.0
  • http://www.example.com/page.php?p=1.00000
  • http://www.example.com/page.php?p=1.5
  • http://www.example.com/page.php?p=1&p=2
  • http://www.example.com/page.php?p=2&p=1
  • http://www.example.com/page.php?p=-1
  • http://www.example.com/page.php?p=-5
  • http://www.example.com/page.php?p=ianring

Naturally, if your web application uses URL rewriting to camouflage the querystring as a file path, the same issues will need to be verified in whatever parts of the URL they occupy. I see this done badly all the time.

Tip #3: If you’re going to fluff your URL, have the kajones to validate your fluff.

Too commonly, I’ll see URLs like this one:

http://www.example.com/restaurants/46335/jimmys-lunch-hamilton-ontario.php

The “46335” is a dead giveaway. That’s a database row identifier. Everything after that slash is URL fluff, put there to make the file name keywordy. Most of the time, developers are using a loose-ended regular expression to rewrite the URL, and they’re not verifying that “46335” is actually Jimmy’s Lunch in Hamilton. Ha! I guarantee, 9 times out of 10, you could request this, and get the same thing:

http://www.example.com/restaurants/46335/i-can-type-anything-i-want-here.php

Note to scraping hobbyists: integer row IDs are lovely because they’re so predictable. When you find a loose end like this one, use a spreadsheet to construct URLs with all the numbers from 1 to a kajillion. Save it as a TXT file and feed the URL list into your favourite downloading program. Parse the HTML you get back, et voila: you have stolen their entire database of stuff.

Note to malicious competitors and hackers: load up your blog spammer with a kajillion permutations of that link, each with a different random phrase as the file name. Stir and wait. Then dance with joy as the bots discover a kajillion different URLs showing the same content on your competitor’s site. Goodbye, organic traffic!

Note to webmasters: don’t make it easy for scraping hobbyists and malicious competitors and hackers. Keep your canonicalization orifices sealed.

Example from a famous website:
http://www.amazon.com/indigo+can+bite+my+anus/dp/1597491543/i+can+write+anything+i+want+here/yes+i+can/you+can+too

And another:
http://www.chapters.indigo.ca/books/amazon-sucks-donkey-balls/9780470170779-item.html

Tip #4: Be diligent with your Querystrings

The querystring is where most potential canonical issues reside, and they’re the most difficult ones to fix. The problems you might find in the querystring include:

  1. variables there that aren’t needed
  2. default values being defined
  3. variables not in a canonical sequence
  4. values out of range
  5. lingering state variables

I’ll take them one by one:

Variables there that aren’t needed

Say you have a URL like http://www.example.com/page.php?a=1&b=2
If the value of “b” isn’t needed on this page, it shouldn’t be there.

Every script (I say this confidently) can enumerate what variables it needs to identify a resource. When anything appears in the querystring that isn’t expected or needed, it’s a potential canonical issue.

To fix this, have all your scripts create an array of valid variables for itself. Pass that array into a function that iterates through that list, picking those variables out of the querystring to create a new array. Join that new array on “&” and compare it to the original querystring. If the strings are different, reattach the new one and redirect. This routine is called “scrubbing”.

A querystring scrub will remove garbage added by referring links, such as affiliate codes, tracking identifiers, or just garbage: http://www.example.com/page.php?foo=bar

Some will argue that this is a non-issue, and search engines won’t care if there’s an affiliate tracking code on the URL – they’ll always index things perfectly and intelligently. I reply: Bots are stupid. See tenet #1.

Default values being defined

This is discussed under Tip 2 – when your querystring contains any default value that doesn’t need to be there, remove it.

Variables not in canonical sequence

I’ve had many arguments about this one. To a programmer, there is no difference between these two URLs:

http://www.example.com/page.php?a=1&b=2

http://www.example.com/page.php?b=2&a=1

They both describe the same collection of variables, with their values equally accessible by name. But when compared as strings, the URLs are different. The common (and very reasonable) argument is “it doesn’t matter”. My response is my mantra: Bots are stupid. See tenet #1.

The simplest (and suggested) fix for this problem is to push your querystring collection into an array, sort it alphabetically by name, rejoin the array with “&”, then compare the sorted string to the original. If it changed, reattach the sorted querystring and redirect.

Values out of range

This one is nefarious and very common. Revisiting our hypothetical paginated results, imagine that our document has only 13 pages. What is the expected response from these URLs?
http://www.example.com/page.php?page=14 (out of range)
http://www.example.com/page.php?page=500 (way out of range)
http://www.example.com/page.php?page=-1 (negative pages – duh)
http://www.example.com/page.php?page=ianring (not even a positive integer)
http://www.example.com/page.php?page= (undefined)

… I’d expect all of these requests to return a 404 Not Found. If you’re especially kindhearted, you might give the user a 301 sending them to page one, especially with that last undefined one. If the response is anything besides those two, it’s a canonical issue.

Loose value validation is another instance where you’ve inadvertently created a potentially infinite number of URLs that show the same resource. A competitor could create a kajillion such URLs and send them out as bot food, and those bots could see them all as dupes.

Lingering state variables

These are really annoying.

http://www.example.com/page.php?back=/prevpage.php

The developer thought it would be nice to put the referrer in the querystring, so scripts on the page could send you back where you came from. This is somewhat tolerable on “login” pages, when you’re wrenched away to authenticate, then after signing in, flung back to the page you wanted. But generally it’s a bad idea to put state variables in the querystring. Lots of people do it. Sometimes there’s no alternative.

Or how about this gem:

http://www.example.com/page.php?referrer=Adwords

That last one – the “referrer” – is not needed to identify a resource. It’s a tracking code. Lest some silly bot think that the page with “referrer” is different from the one without, your best practice is to “scrub” the querystring – put your referrer data into a session or database, remove that variable and redirect.

Exterminating Canonical Errors, Step By Step

Now, let’s get down to the task at hand: non-canonical URL extermination.

You will need:

  • A live website to be audited
  • A spreadsheet program, like Excel
  • A downloading/link checking tool that returns a report of status headers.
  • A Taxonomy of Canonical Errors (below)

Choose any page on the site for which you’re doing a canonicalization audit. Decide upon a canonical URL, keeping in mind the tips lovingly revealed above. Then using the taxonomy below, construct variations of that URL with an assortment of aberrations, all of which you would hope return a 404 Not Found error, or a 301 redirection to your canonical URL. This list will be URLs that are incorrect, but could possibly deliver dupe content. Warning: it could be quite a long list! One that I created recently using computed permutations of 10 different potential canonicalization errors produced a list of about 1700 URLs, after many with invalid syntax were removed.

Put all your aberrations into a spreadsheet or txt file. Feed this list into a downloading script, and generate a report showing the status header returned from each request.

Interpret the report. None of the URLs should return a 200 OK – they must all return either a 301 to the canonical url, or a 404 if the URL is too ambiguous.

You’ll want to repeat this process with a variety of pages on the site. Include the home page, some interior content pages, and especially any pages that have dynamic content, or which rely on a querystring.

I have divided the most common canonical issues into a taxonomy, published for the first time below. Please excuse the bulleted lists; Rae’s CSS seems to override my styling and I can’t be bothered to fix it

A Taxonomy of Canonicalization Errors

© Ian Ring 2008, published with permission on sugarrae.com

  • 1 extra characters in the URL
    • 1.1 in the subdomain, e.g. “www”
    • 1.2 in the port, e.g. “:80”
    • 1.3 in the file path, see URL fluff
    • 1.4 in the file name, e.g. “index.php”, also see URL fluff
    • 1.5 in the querystring
    • 1.6 in the file path
      • 1.6.1 extra inserted slashes
      • 1.6.2 dotted modifiers, eg “x/y/z/../../a/b/../../../page.php”
    • 1.7 in the querystring
      • 1.7.1 A “?” with no query after it
      • 1.7.2 More than one “?”
      • 1.7.3 Extra “&” characters
      • 1.7.4 fictitious querystring variables
  • 2 missing characters in the URL
    • 2.1 in the file path
      • 2.1.1 missing trailing slash on path
    • 2.2 in the querystring
      • 2.2.1 undeclared and undefined variables (eg “?a=” or “?=2”)
  • 3 modified/incorrect characters in the URL
    • 3.1 in the subdomain
      • 3.1.1 see if “blog.example.com” delivers the same content as “example.com/blog”
    • 3.2 in the domain
      • 3.2.1 using IP address instead of the domain name
    • 3.3 in the file path
      • 3.3.1 extended characters, and case sensitivity
      • 3.3.2 inconsistent use of “+” and “%20”
    • 3.4 in the file name
      • 3.4.1 badly mapped extensions, e.g. *.php, *.htm, *.html all mapped to the same script
      • 3.4.2 extended characters, and case sensitivity
      • 3.4.3 inconsistent use of “+” and “%20”
    • 3.5 in the querystring
      • 3.5.1 variables not in canonical sequence
      • 3.5.2 out of range querystring values
      • 3.5.3 lingering state variables
      • 3.5.4 inconsistent use of “+” and “%20”

If you use this list to find canonicalization problems, you’ve done a good audit. The next step is to bring your findings back to the developers and webmaster, and get them to fix the problems. If the webmaster is serious about canonicalization, it behooves them to build a URL normalizer – a script that runs on each page load looking for known patterns and rewriting, redirecting, scrubbing, sessionizing, and acting appropriately. Many of these issues can be solved with regular expressions in a Rewrite module such as Apache’s .htaccess file or ISAPI Rewrite for .NET. Others require diligent refactoring in web applications for value validation to throw exceptions to a canonicalization routine. The construction of that URL normalizer is outside the scope of guestwhore duties.

If your website is properly guarded against canonicalization errors, I should be able to throw any messed-up URL at it and it will behave predictably and appropriately. And know that if your SEO expert doesn’t find these problems, your visitors (including bots) certainly will.

Happy normalizing!

Now go watch me put mentos up my nose and immerse my head in Diet Coke.


About the Author

Ian Ring

I was abandoned in the jungle in India as a baby and raised by a family of wolves. As a young boy I befriended a goofy singing bear and a sarcastic jaguar. My enemies included a huge boa constrictor with hypnotic eyes and a man-eating tiger. Eventually I left the wolf pack and traveled to the man-village, where I learned HTML and web development.

 

Subscribe to the Sugarrae feed | Follow Sugarrae on Twitter

Related Posts

Sugarrae runs on the Thesis WordPress Theme

Thesis WordPress theme

If you’re someone who doesn’t understand a lot of PHP, Thesis will give a ton of functionality that you wouldn’t be able to obtain otherwise with a simple control panel instead of having to alter code. For the advanced, Thesis has incredible customization possibilities via Thesis hooks.

For those "in between", like myself, I’ve created "dummy" guides for Thesis hooks that allow us to make more professional customizations than we ever deemed possible. The theme is not only highly customizable, but it has allowed me to run Sugarrae more professionally, with a much more targeted focus on monetization than it ever has been able to achieve before. You can find out more about Thesis below:



Comments on this entry are closed.

Previous post:

Next post: