Ian Ring

Be a Normalizer - a C14N Exterminator

by Ian Ring on February 22, 2008 | Guestwhore Posts

If you're new here, you may want to subscribe to my feed or subscribe to me on Twitter, which is updated on a more frequent - and more meaningless - basis.

Hi Sugarrae readers! I’ve prepared something a little less fun today. To make my guest post worthwhile (and to donate about 3,000 words of heavy SEO material to Rae’s blog), we’re going back to SEO Kindergarten to review one of the most basic tenets of SEO – canonicalization (C14N) - in excruciating detail. Examples here include excerpts from one chapter of Advanced URL Management, a work in progress currently looking for a publisher. (If anyone from O’Reilly is reading this – let’s talk)

If you’re in a hurry, skip to the good part. Otherwise, read on. If you’re not interested in SEO, go watch me put mentos up my nose and immerse my head in Diet Coke.

Rae reminded all us whores not to forget introductions… Hello. My name is Ian Ring. I met Sugarrae during a stint working in Guelph a couple of years ago. I build websites long time. My interests include random poetry, building facebook apps, and eating sandwiches. Some of you may know me by a pseudonym. Let’s move on.

Become a Canonical URL Exterminator

Canonical issues arise when more than one URL may be used to deliver the same resource. Most canonical problems are caused by forgiving agents and overly tolerant servers – given a set of rules, these machines assume that even though you asked for B, you really wanted A. So it gives you the content of A, as though it is actually B. Fixing canonical bugs is achieved by the systematic removal or overriding of those assumptions.

The problem of canonical URLs is fairly basic SEO 101 stuff, and anyone who has been reading Rae’s blog is probably yawningly aware of what they are, why they’re bad, and so on. Yet even some experienced SEOs, while they know a Canonical Error when they see one, don’t know where to look to find them. Or they’ll know a few common problems, but they don’t have a systematic way to sweep a site looking for every possible canonicalization vulnerability.

My opinions reflect an assertion that bots are stupid. I’m not accusing any particular bot or engine, and actually the major engines are pretty good at interpreting URLs, avoiding double-indexing, and they generally handle URL normalization intelligently, notwithstanding some documented historical fuckups. But face it - when a page is badly indexed, the fault is rarely with the engine. If your URL architecture is perfect, there’ll be nothing for them to screw up. Indexing problems are ultimately the result of a webmaster not applying diligence in URL management, and/or placing unwarranted faith in bots not to be stupid.

Let’s define this as tenet #1: Bots are stupid. Assume this to be true a priori.

Canonical problems are elusive, because you’re not searching for something on the site that can be found by following links and running a spell-check. A Canonical error is one that you can only diagnose by “linking outside the box” – trying unusual, sometimes wacky, alternative URLs to see whether your site delivers dupe content.

To be certain that there are no Canonical problems on a site, you need to tackle the task systematically. Starting with the canonical URL, there are three ways that a URL can be modified. They are:
Omissions: something should be there but isn’t.
Inclusions: something is there that shouldn’t be.
Modifications: it’s there, but it’s wrong.

These aberrations may appear in any of the URL parts: protocol, port, subdomain, domain, file path, file name, and querystring. I’ll omit the user, password, and fragment portions of a URL since they rarely cause SEO or canonicalization problems.

First I’ll start with a few general tips for canonicalization.

Canonicalization Tips

Tip 1: Just use lower case

… for everything. I take a conservative approach to c14n. I prefer everything* in lower-case, even where the HTTP spec doesn’t indicate case sensitivity. I don’t care if the HTTP spec says domains are case insensitive … if your server is set up to convert all URLs to lower case, it solves a whole set of canonical issues all at once.

As a side effect of forcing the entire URL into lower case, I even force letters in escape sequences to lower case, e.g. “%3A” (a urlencoded colon) becomes “%3a”. Purists lay off - the official HTTP spec says escape sequences are case-insensitive.

Some might argue that it doesn’t matter if you use “%3A” or “%3a” – the engines will interpret and normalize URLs properly. I reply: Bots are stupid. See tenet #1.

* exception: user-entered values in the querystring should not be converted, eg. in a search query

Tip 2: If there’s a default value for something, omit it.

Well, you can either omit it or force it – for instance I tend to force the default “www” subdomain onto URLs if it’s missing. I could just as easily force it to be absent. Whatever – choose one, and be consistent.

Check each of part of the URL for any default values that can be omitted. For example, your default HTTP port is probably 80. So when someone requests this:
http://www.example.com:80/page.php, it’s the same as requesting
http://www.example.com/page.php. If you receive a request for port 80, you should redirect to the canonical URL with the default port removed.

The same default value hunting follows through the rest of the URL parts – the default file name may be “index.php”. The default protocol is http:// … try requesting your canonical URL with and without each of these bits to see if it is:

  1. Canonical, and delivered OK as expected,
  2. Non-canonical, and resolved by your browser, thus doesn’t need any redirection (e.g. most browsers will add the “http” for you, if you omit it.)
  3. Non-canonical, and redirected by your server with a 301 Permanent Redirection status response, or
  4. Erroneous, and handled with a 404 Not Found error page.

Dealing with default values is pretty simple, everywhere except in the querystring. Querystrings are very easy to use badly, and even when they’re used correctly, they can be canonically weak.

Some might argue that canonicalization in the querystring is a non-issue. They suggest that all the search engines always interpret and normalize URLs with querystrings properly, and it doesn’t matter if your URL has some weird shit after the question mark. I reply: Bots are stupid. See tenet #1.

Consider a script that shows paginated data. Page one of the data may have a URL like:
http://www.example.com/page.php?p=1
In this hypothetical situation, our friendly developers have realized that dependence on a pagination variable is assumptive, so they’ve written their code to show page 1, if no “p” variable is supplied. In fact, they show page 1 data whenever p is not a positive integer above 0. The intent is good, and they get marks for writing code that “fails gracefully”, but a better solution would have been to force a 301 redirection to the canonical URL, rather than show the same content without complaint.

The developers don’t tell you these things. You (as SEO) need to go in, look, test, and identify whether the omission or inclusion of a default value affects canonicalization.

First, choose which you’re going to canonicalize: the URL with the value, or the URL without. If you’re having trouble deciding, let me help you – leave it out. Brief URLs are good. And if you always choose the shorter option, you’ll never need to remember what method you used in which situations – so the rule of thumb will be: if the value is default, omit it.

Where the default value of querystring variable “p” is “1”, a diligent canonical exterminator will know that all of these are potential canonical dupes, and will check them:

  • http://www.example.com/page.php
  • http://www.example.com/page.php?
  • http://www.example.com/page.php?p=
  • http://www.example.com/page.php?p=1
  • http://www.example.com/page.php?p=1.0
  • http://www.example.com/page.php?p=1.00000
  • http://www.example.com/page.php?p=1.5
  • http://www.example.com/page.php?p=1&p=2
  • http://www.example.com/page.php?p=2&p=1
  • http://www.example.com/page.php?p=-1
  • http://www.example.com/page.php?p=-5
  • http://www.example.com/page.php?p=ianring

Naturally, if your web application uses URL rewriting to camouflage the querystring as a file path, the same issues will need to be verified in whatever parts of the URL they occupy. I see this done badly all the time.

Tip #3: If you’re going to fluff your URL, have the kajones to validate your fluff.

Too commonly, I’ll see URLs like this one:
http://www.example.com/restaurants/46335/jimmys-lunch-hamilton-ontario.php
The “46335” is a dead giveaway. That’s a database row identifier. Everything after that slash is URL fluff, put there to make the file name keywordy. Most of the time, developers are using a loose-ended regular expression to rewrite the URL, and they’re not verifying that “46335” is actually Jimmy’s Lunch in Hamilton. Ha! I guarantee, 9 times out of 10, you could request this, and get the same thing:
http://www.example.com/restaurants/46335/i-can-type-anything-i-want-here.php

Note to scraping hobbyists: integer row IDs are lovely because they’re so predictable. When you find a loose end like this one, use a spreadsheet to construct URLs with all the numbers from 1 to a kajillion. Save it as a TXT file and feed the URL list into your favourite downloading program. Parse the HTML you get back, et voila: you have stolen their entire database of stuff.

Note to malicious competitors and hackers: load up your blog spammer with a kajillion permutations of that link, each with a different random phrase as the file name. Stir and wait. Then dance with joy as the bots discover a kajillion different URLs showing the same content on your competitor’s site. Goodbye, organic traffic!

Note to webmasters: don’t make it easy for scraping hobbyists and malicious competitors and hackers. Keep your canonicalization orifices sealed.

Example from a famous website:
http://www.amazon.com/indigo+can+bite+my+anus/dp/1597491543/i+can+write+anything+i+want+here/yes+i+can/you+can+too

And another:
http://www.chapters.indigo.ca/books/amazon-sucks-donkey-balls/9780470170779-item.html

Tip #4: Be diligent with your Querystrings

The querystring is where most potential canonical issues reside, and they’re the most difficult ones to fix. The problems you might find in the querystring include:

  1. variables there that aren’t needed
  2. default values being defined
  3. variables not in a canonical sequence
  4. values out of range
  5. lingering state variables

I’ll take them one by one:

Variables there that aren’t needed

Say you have a URL like http://www.example.com/page.php?a=1&b=2
If the value of “b” isn’t needed on this page, it shouldn’t be there.

Every script (I say this confidently) can enumerate what variables it needs to identify a resource. When anything appears in the querystring that isn’t expected or needed, it’s a potential canonical issue.

To fix this, have all your scripts create an array of valid variables for itself. Pass that array into a function that iterates through that list, picking those variables out of the querystring to create a new array. Join that new array on “&” and compare it to the original querystring. If the strings are different, reattach the new one and redirect. This routine is called “scrubbing”.

A querystring scrub will remove garbage added by referring links, such as affiliate codes, tracking identifiers, or just garbage: http://www.example.com/page.php?foo=bar

Some will argue that this is a non-issue, and search engines won’t care if there’s an affiliate tracking code on the URL – they’ll always index things perfectly and intelligently. I reply: Bots are stupid. See tenet #1.

Default values being defined

This is discussed under Tip 2 – when your querystring contains any default value that doesn’t need to be there, remove it.

Variables not in canonical sequence

I’ve had many arguments about this one. To a programmer, there is no difference between these two URLs:
http://www.example.com/page.php?a=1&b=2
http://www.example.com/page.php?b=2&a=1

They both describe the same collection of variables, with their values equally accessible by name. But when compared as strings, the URLs are different. The common (and very reasonable) argument is “it doesn’t matter”. My response is my mantra: Bots are stupid. See tenet #1.

The simplest (and suggested) fix for this problem is to push your querystring collection into an array, sort it alphabetically by name, rejoin the array with “&”, then compare the sorted string to the original. If it changed, reattach the sorted querystring and redirect.

Values out of range

This one is nefarious and very common. Revisiting our hypothetical paginated results, imagine that our document has only 13 pages. What is the expected response from these URLs?
http://www.example.com/page.php?page=14 (out of range)
http://www.example.com/page.php?page=500 (way out of range)
http://www.example.com/page.php?page=-1 (negative pages – duh)
http://www.example.com/page.php?page=ianring (not even a positive integer)
http://www.example.com/page.php?page= (undefined)

… I’d expect all of these requests to return a 404 Not Found. If you’re especially kindhearted, you might give the user a 301 sending them to page one, especially with that last undefined one. If the response is anything besides those two, it’s a canonical issue.

Loose value validation is another instance where you’ve inadvertently created a potentially infinite number of URLs that show the same resource. A competitor could create a kajillion such URLs and send them out as bot food, and those bots could see them all as dupes.

Lingering state variables

These are really annoying.
http://www.example.com/page.php?back=/prevpage.php
The developer thought it would be nice to put the referrer in the querystring, so scripts on the page could send you back where you came from. This is somewhat tolerable on “login” pages, when you’re wrenched away to authenticate, then after signing in, flung back to the page you wanted. But generally it’s a bad idea to put state variables in the querystring. Lots of people do it. Sometimes there’s no alternative.

Or how about this gem:
http://www.example.com/page.php?referrer=Adwords
That last one – the “referrer” – is not needed to identify a resource. It’s a tracking code. Lest some silly bot think that the page with “referrer” is different from the one without, your best practice is to “scrub” the querystring – put your referrer data into a session or database, remove that variable and redirect.

Exterminating Canonical Errors, Step By Step

Now, let’s get down to the task at hand: non-canonical URL extermination.

You will need:

  • A live website to be audited
  • A spreadsheet program, like Excel
  • A downloading/link checking tool that returns a report of status headers.
  • A Taxonomy of Canonical Errors (below)

Choose any page on the site for which you’re doing a canonicalization audit. Decide upon a canonical URL, keeping in mind the tips lovingly revealed above. Then using the taxonomy below, construct variations of that URL with an assortment of aberrations, all of which you would hope return a 404 Not Found error, or a 301 redirection to your canonical URL. This list will be URLs that are incorrect, but could possibly deliver dupe content. Warning: it could be quite a long list! One that I created recently using computed permutations of 10 different potential canonicalization errors produced a list of about 1700 URLs, after many with invalid syntax were removed.

Put all your aberrations into a spreadsheet or txt file. Feed this list into a downloading script, and generate a report showing the status header returned from each request.

Interpret the report. None of the URLs should return a 200 OK – they must all return either a 301 to the canonical url, or a 404 if the URL is too ambiguous.

You’ll want to repeat this process with a variety of pages on the site. Include the home page, some interior content pages, and especially any pages that have dynamic content, or which rely on a querystring.

I have divided the most common canonical issues into a taxonomy, published for the first time below. Please excuse the bulleted lists; Rae’s CSS seems to override my styling and I can’t be bothered to fix it

A Taxonomy of Canonicalization Errors

© Ian Ring 2008, published with permission on sugarrae.com

  • 1 extra characters in the URL
    • 1.1 in the subdomain, e.g. “www”
    • 1.2 in the port, e.g. “:80”
    • 1.3 in the file path, see URL fluff
    • 1.4 in the file name, e.g. “index.php”, also see URL fluff
    • 1.5 in the querystring
    • 1.6 in the file path
      • 1.6.1 extra inserted slashes
      • 1.6.2 dotted modifiers, eg “x/y/z/../../a/b/../../../page.php”
    • 1.7 in the querystring
      • 1.7.1 A “?” with no query after it
      • 1.7.2 More than one “?”
      • 1.7.3 Extra “&” characters
      • 1.7.4 fictitious querystring variables
  • 2 missing characters in the URL
    • 2.1 in the file path
      • 2.1.1 missing trailing slash on path
    • 2.2 in the querystring
      • 2.2.1 undeclared and undefined variables (eg “?a=” or “?=2”)
  • 3 modified/incorrect characters in the URL
    • 3.1 in the subdomain
      • 3.1.1 see if “blog.example.com” delivers the same content as “example.com/blog”
    • 3.2 in the domain
      • 3.2.1 using IP address instead of the domain name
    • 3.3 in the file path
      • 3.3.1 extended characters, and case sensitivity
      • 3.3.2 inconsistent use of “+” and “%20”
    • 3.4 in the file name
      • 3.4.1 badly mapped extensions, e.g. *.php, *.htm, *.html all mapped to the same script
      • 3.4.2 extended characters, and case sensitivity
      • 3.4.3 inconsistent use of “+” and “%20”
    • 3.5 in the querystring
      • 3.5.1 variables not in canonical sequence
      • 3.5.2 out of range querystring values
      • 3.5.3 lingering state variables
      • 3.5.4 inconsistent use of “+” and “%20”

If you use this list to find canonicalization problems, you’ve done a good audit. The next step is to bring your findings back to the developers and webmaster, and get them to fix the problems. If the webmaster is serious about canonicalization, it behooves them to build a URL normalizer – a script that runs on each page load looking for known patterns and rewriting, redirecting, scrubbing, sessionizing, and acting appropriately. Many of these issues can be solved with regular expressions in a Rewrite module such as Apache’s .htaccess file or ISAPI Rewrite for .NET. Others require diligent refactoring in web applications for value validation to throw exceptions to a canonicalization routine. The construction of that URL normalizer is outside the scope of guestwhore duties.

If your website is properly guarded against canonicalization errors, I should be able to throw any messed-up URL at it and it will behave predictably and appropriately. And know that if your SEO expert doesn’t find these problems, your visitors (including bots) certainly will.

Happy normalizing!

Now go watch me put mentos up my nose and immerse my head in Diet Coke.

About the Author

Ian Ring

I was abandoned in the jungle in India as a baby and raised by a family of wolves. As a young boy I befriended a goofy singing bear and a sarcastic jaguar. My enemies included a huge boa constrictor with hypnotic eyes and a man-eating tiger. Eventually I left the wolf pack and traveled to the man-village, where I learned HTML and web development.

Get social with Ian at Find me at WebmasterWorld

Subscribe to the Sugarrae feed | Follow Sugarrae on Twitter

Related Posts

Sugarrae runs on the Thesis WordPress Theme

Thesis WordPress theme

If you’re someone who doesn’t understand a lot of PHP, Thesis will give a ton of functionality that you wouldn’t be able to obtain otherwise with a simple control panel instead of having to alter code. For the advanced, Thesis has incredible customization possibilities via Thesis hooks.

For those "in between", like myself, I’ve created "dummy" guides for Thesis hooks that allow us to make more professional customizations than we ever deemed possible. The theme is not only highly customizable, but it has allowed me to run Sugarrae more professionally, with a much more targeted focus on monetization than it ever has been able to achieve before. You can find out more about Thesis below:

{ 6 trackbacks }

ianring.com » Blog Archive » I’m a Guestwhore.
February 22, 2008 at 10:56 am
This Month In SEO - 2/08 - TheVanBlog
February 29, 2008 at 9:30 pm
How to Detect Canonicalization Issues » Shimon Sandler - SEO Consultant
March 31, 2008 at 12:32 pm
Domain Canonicalization
May 31, 2008 at 3:34 am
Search Tech - All 2009 Nominees » SEMMYS.org
January 19, 2009 at 1:05 pm
Ian Ring’s blog » Blog Archive » How flattering! A SEMMY nomination
January 19, 2009 at 3:39 pm

{ 13 comments… read them below or add one }

1 Lisa Barone February 22, 2008 at 12:49 pm

Woah. Who told Ian he could write something useful? Didn’t he get the “meaningless rant” memo?

2 Search Engine Optimization New York February 22, 2008 at 1:56 pm

Holy cannoli that’s a thorough guide! Nice work Ian.

3 Adrienne Doss February 22, 2008 at 7:28 pm

Honestly, this makes my brain hurt. But in a good way. I’m going to have to read it a few more times before it sinks in.

Rhea Drysdale 4 rdrysdale February 23, 2008 at 3:20 pm

Alright, I finally had the time to absorb this. Nice auditing process. It takes a little while, but clearly doesn’t need to be done on a weekly basis, so it’s well worth the setup. My problem has always been understanding what’s wrong and not having the technical know-how to fix it. Slowly getting better on that front. Thanks for the referrals. We’re working with ISAPI rewrite now to counter some issues.

Rae Hoffman 5 Rae Hoffman February 24, 2008 at 3:57 pm

Ian… wow… that was a lot of info…

1. Never post in dual categories on the blog :)
2. You weren’t supposed to TEACH anyone anything.

Thanks and awesome job dude. ;)

6 Ian Ring February 25, 2008 at 2:51 am

thanks Rae
a lot of material was omitted from this post… watch for more installments like this one on ianring.com and at webmasterworld in the coming year. I consider it a compliment to be the only guestwhore who got dugg. :)

7 Soeren Sprogoe February 25, 2008 at 4:46 pm

One thing that surprised me, when I originally learned about c14n, was that Google actually treats yourdomain.com and yourdomain.com/default.aspx as two completely different pages!

So page value will be divided out over the two, but one of them wil (most likely) be marked as dupe content and be removed from the index!

So I can definately confirm that you need to be consequent when doing (internal) linking:
- Allways use lower case.
- Allways use http://www. (or don’t).
- Allways link to your frontpage with either / or default.aspx (or index.php, depending on your choice of technology).

8 httpwebwitch February 25, 2008 at 6:29 pm

Yes Soeren. I didn’t delve into the “why” of C14N in this post, since that info is widely available elsewhere, but that’s the crux: C14N prevents double-indexing and supplemental problems in the SERPs. Taken further, the same techniques you use for normalizing can catch and redirect IBLs that include common typos, odd punctuation or other abnormalities.

I also neglected the “how”, as in “how to fix it”. I only covered the “what” here, which apologetically only helps you identify the problems, it doesn’t offer solutions. Nonetheless I think this post is a helpful guide, if only for the first half of the journey.

Cheers, hww

9 dan February 25, 2008 at 6:51 pm

I’ve recently been working on site that uses a lot of “URL Fluffing” (great term) as in Tip #3.

Fixing this I guess would just be a simple rewrite rule. Unfortunately rewrite rules are never simple for me.

Using the given example C14N URL of
http://www.example.com/restaurants/46335/jimmys-lunch-hamilton-ontario.php

would the rewrite rule just be
RewriteRule http://www.example.com/restaurants/(.*)/.* http://www.example.com/restaurants/46335/jimmys-lunch-hamilton-ontario.php [R=301,L]

I believe such an approach would require a separate rule for each URL, but I guess you have to maintain your fluff somewhere.

10 Chad Ledford February 26, 2008 at 12:14 pm

Great post on canonicalization. Digg has actually just fixed one of their issues (http://www.3tailer.com/sundry/digg-implements-the-1000000-idea-non-www-301-redirect)

11 httpwebwitch February 26, 2008 at 7:44 pm

@dan: the rewriterule for my example is something more like

RewriteRule /restaurants/([0-9]+)/.* /r.php?id=$1

Validating your fluff is usually not handled with rewriting rules; they’re accomplished in the code since you need to compare the “fluff” with some actual data associated with a RowID.

BTW in Wordpress they call that a “post slug” used to create “pretty permalinks”

I presume your fluff will be some modified and hyphenated version of a title or name in the data.
Once you’ve figured out which restaurant is #46335, parse the URL and see if your fluff is an exact match for the string you expect to be associated with that data row.

done properly, your rewriterule becomes:

RewriteRule /restaurants/([0-9]+)/
(.*) /r.php?id=$1&fluff=$2

create a slug and compare it to your fluff. If they’re not identical, you’ve already got your slug so redirect to a new URL using that.

the rewriterules giveth, and the code taketh away

Mike Riley 12 Mike Riley October 4, 2008 at 3:46 am

I usually find that the best way to handle the problem of Canonicalization, as well as a slew of others, is to use a framework that just rewrites all URLs to a single file and then routes them. Ala ExpressionEngine or CodeIgniter (or Ruby!). The real beauty of doing this is that you don’t really need to ever screw around with regular expressions, and definitely don’t need to mess with the .htaccess or apache directives in order to reconfigure your URL rules, everything is defined in the same language you’re using to actually generate the pages. It really makes for a much smoother way of handling this.

13 httpwebwitch January 19, 2009 at 3:27 pm

This post has been nominated for a SEMMY
http://www.semmys.org/2009/search-tech-all-2009-nominees/

Leave a Comment

Want to add a picture and social profile links to your comments here on Sugarrae.com? All you gotta do is be be logged in when commenting after filling out your user profile.

Not registered at Sugarrae? Register now!

Please note that by clicking submit, you agree to abide by the comment policy.