So, I’ve met some of the people from the Sitemaps team a few times and when I was in Boston I had a chance to grab lunch with two of the Sitemaps engineers where I shamelessly asked for an interview and they agreed.
I wanted to kind of do a different spin on Sitemaps information by looking at what about Sitemaps was different for and specific to bloggers. Of course, I asked some general questions to be nosy. ;-) Read on to find out what the differences are for bloggers, whether or not using Sitemaps can hurt your indexing (as claimed by some) and to find out why Fiesta Giles is hanging out at the Seattle Google offices.
Q&A with Google Sitemap Engineer Vanessa Fox
RAE: Thanks a lot for doing this interview. I know from meeting you in person several times now that you, as well as your team, are dedicated to making Sitemaps the best product possible. Let’s start off by assuming not everyone knows what Google Sitemaps is all about. Why was the Sitemaps project created and how does using it help individual site owners as well as Google and how do you see Sitemaps developing over the next year or so?
VF: Google Sitemaps helps us communicate better with webmasters. Overall, this two-way communication between Google and webmasters can help improve sites across the web and make them more accessible through Google, which ultimately helps Google users get the best search results.
This service was launched just over a year ago as a way for webmasters to tell us about all the pages on their sites. Webmasters can submit all their URLs to the Google index using the XML-based Sitemaps Protocol or by submitting a simple text file, RSS or Atom feed. The first step to getting your site’s pages into the Google index is to make sure that the Googlebot can crawl your pages. Sitemaps is an easy way to make sure Google knows about all your pages right away and that those pages are crawler-friendly. This is especially useful for new sites that don’t have a lot of external links pointing to them or for sites with pages that are difficult to find through regular crawling (for instance, pages that are dynamically generated). Without a Sitemap, we wouldn’t know about these pages until we crawled links to them.
Since our launch last year, we’ve expanded the service to 17 languages. We’ve also added detailed and helpful site statistics (such as, the top Google search queries driving traffic to your site) and troubleshooting tools to help webmasters make their pages more crawler-friendly. We are regularly adding new features. Some examples of recent features include:
- Robots.txt analysis: Quickly see if you’ve accidentally blocked your home page with your robots.txt file and test your robots.txt file against Google’s user agents.
- Crawl errors: See what pages Googlebot had difficulty crawling. You can see at a glance if you’ve accidentally blocked your ome page or if you have pages that redirect to themselves.
- Common words report: See what words we most commonly find on your site that and external sites most commonly use to link to you.
We spend a lot of time looking into what information we can provide that would help webmasters, and we listen to the feedback we get. In the next year, you can expect us to continue to do that and launch features in response to user needs.
RAE: Specifically, how does Sitemaps help a blog owner in general as well as a blog owner vs. a “regular” webmaster?
VF: Blogs change regularly. Bloggers can submit their RSS or Atom feeds as Sitemaps and as that feed changes, we’ll pick it up to learn about the latest pages. A blogger doesn’t need to resubmit the feed when it changes; we regularly rescan the feeds automatically.
Our information on search queries may be particularly interesting to bloggers. You may not realize, for instance, that some obscure post you wrote is bringing you a lot of traffic or that a particular post comes up in search results a lot, but no one clicks on it. You could use those opportunities to write more about what people are interested in.
The crawl errors we report are also very useful for bloggers and other webmasters. Many times, pages from a site aren’t in the index because we weren’t able to crawl them. With Google Sitemaps, we report the specific URL we couldn’t crawl and the reason (including the HTTP status code we got from the server, if applicable). Some reasons we can’t crawl pages includes:
- There are two many redirects in the redirect chain or the page redirects back to itself (this can sometimes happen with a chain of redirects)
- We receive a network error or server timeout
- The server doesn’t allow us to access the page (because the page is blocked or password-protected)
- The server returned a 404 (page not found)
- The page is blocked by the robots.txt file
The robots.txt analysis tool is handy. You can see the status of the file, test what is blocked and allowed, and test changes to the file. A lot of people are hesitant to use a robots.txt file, and this tool takes the guesswork out of it.
RAE: When we talked at Boston PubCon, I expressed some difficulties I was having with verification on a site or two. You had mentioned at that time that you were coming out with Meta tag verification to help blog owners using public sub domains (like http://Sitemaps.blogspot.com) as well as those having verification problems for other reasons. Can you give a bit of detail on that for those who may not have heard the news?
VF: For your privacy, you have to verify that you own a site before we show you most of the information available in Google Sitemaps. Our original verification method asked you to upload a file with a unique name to your site. However, not all webmasters can upload a file. Many bloggers, for instance, use web-based blogging software that doesn’t allow this. We now provide a second option for site verification that asked you to place a unique meta tag in the index page of your site. We chose this method because most webmasters who can’t upload a file can edit their index page and we were able to implement it in a secure way. We have many checks in place to ensure that no one can claim ownership of your site. For instance, you can’t post a comment that contains the meta tag on someone’s blog to verify site ownership.
The meta tag we ask for is unique for each webmaster and must be placed in thesection of the page. The page can include only one section and it must be above the section.
Rae: Are there any special precautions or things to be done if a blog owner has their blog residing on a subdomain or in a folder off the root directory (like http://mattcutts.com/blog/) as far as Sitemaps goes?
VF: You can see the greatest variety of information for sites at the root of the domain, so I would suggest that blog owners with blogs in a folder add both the root domain and the subfolder as sites in their accounts. Once you verify the root domain, the subfolder is verified automatically, so you don’t need to verify it separately. Adding both lets you see all the information that’s only available for root-level domains, but also lets you see specific stats and errors for your blog folder.
If you submit a Sitemap, keep in mind that it can contain URLs only at the Sitemap location and lower. So, for instance, if you have a Sitemap in your /blog/ folder, it can’t list URLs in the root directory. If you submit a syndication feed, this won’t be a problem, since the feed won’t reference URLs that aren’t at the feed location or below, but if you want to submit a Sitemap that uses the protocol and you want it to list all the URLs in your site (not just your blog), you should place that Sitemap in the root directory.
RAE: So, you’ve submitted to Sitemaps and have verified – what can a blogger expect at that point? What will Sitemaps help them learn about their blog? What statistics are the most useful and do you have plans to add more with a specific thought of bloggers in mind?
VF: Once you’ve verified, you’ll see the Diagnostic summary page, which gives you a quick snapshot of your site. This page tells you if pages from your site are indexed, when we last successfully crawled your home page, if your site has violated the webmaster guidelines, if we aren’t able to crawl your home page, and if we experienced a large number of errors crawling your site.
From the Diagnostic tab, you can also view specific crawl errors and try out the robots.txt analysis tool. The Statistics tab shows you the query stats I was talking about earlier. You can see search queries that returned your site in the search results, and what the average top position was for your site for each query. The crawl stats show you a quick view of the percentage of your site crawled successfully and the PageRank distribution of your pages. It also lists the page on your site with the highest PageRank, by month. Page analysis shows you how Googlebot views your site: the content type, the encodings, the common words on your site, and the words commonly used to link to your site. Index stats are handy links to advanced operator queries that many webmasters use regularly.
RAE: In your late April update, you introduced a feature that would notify webmasters of violations in Google’s guidelines (most of them anyway). As I’m sure you know, it was recently reported that you must admit to wrongdoing to submit a re-inclusion request. I’m assuming there is a reason behind it and wanted to know if you’d care to share why Google Sitemaps is doing this (if they are) and what you’d suggest to webmasters and bloggers who are not willing to admit to wrongdoing, but have corrected any “issues” and are ready to request re-inclusion.
VF: Webmasters and bloggers should only use this form is their sites have violated the webmaster guidelines and they’ve fixed the issues. We say that as the first sentence of the form: “Complete this form if you reviewed your site, found that it violated our webmaster guidelines, and you have made changes to your site so that it adheres to the guidelines.” The rest of the form simply confirms that. Webmasters with sites that haven’t violated the guidelines shouldn’t use the form.
In the past, webmasters whose sites were blocked from the index for violating the guidelines had to know to send Google an email and guess at what information would be most helpful for those evaluating the site for reinclusion. Webmasters who read Matt Cutts’ blog knew that, but not everyone did. This form makes filing a reinclusion request easy. It explains exactly what information is needed for a reinclusion request.
Sometimes, webmasters think their sites are blocked but they haven’t violated the guidelines. They think they need to request reinclusion, but then hesitate because they see that it means they need to admit to violating the guidelines (and they haven’t). There’s no need for these webmasters to fill out this form. The webmaster should look at other issues the site may be having. Check the crawl errors to see if we’re having trouble crawling the pages. Use the robots.txt analysis tool to make sure you’re allowing access to your pages. If the trouble is that the site isn’t ranking for particular keywords, check the common words on the site and in external links to the site. Are they the words you want to be ranked for?
We’re always looking at ways we can improve communication with webmasters, and as we’ve received a lot of feedback on this issue, this is one area we are looking at.
RAE: Several of the more recent posts on the Sitemaps blog have been aimed at answering questions relating to affiliate site content, site queries and canonical issues – how involved is the Sitemaps team with the algorithmic side of things and spam prevention issues?
VF: The Sitemaps team works specifically on the Sitemaps project, but we work closely with all the teams involved with organic search for every feature we launch.
RAE: There have been several rumblings on the Sitemaps Google group about submitting a Sitemap having some correlation to site pages dropping from the index. I know from years in this industry that speculation isn’t always fact. Is there any risk to a website or blog who submits or verifies with Sitemaps?
VF: No, participating in the Sitemaps program (either by submitting a Sitemap or verifying ownership of a site) can’t harm your site in any way. One of the goals of Sitemaps is to help webmasters improve their sites’ coverage in the index and while sites may see fluctuations as their sites change, new sites are added to the web, and as our algorithms change, we hope the tools we offer help webmasters through these changes.
RAE: Stories of the Google Campus are passed around a lot by those who have attended the Google Dance at the campus each year during San Jose SES as well as those who have been there for business reasons during a regular workday. (It certainly has the best “cafeteria” I’ve ever seen by a longshot.) Sitemaps is based in the Seattle office – are they keeping you guys comfortable there?
VF: Google works very hard to maintain its culture throughout all the offices, so while I definitely take advantage of the cafes when I’m in Mountain View, Seattle Googlers are very well taken care of, with massages, free food, foosball tables, and many of the other great benefits that all Googlers receive.
RAE: Ok, you knew I was getting to this one. ;-) I request a picture from each interviewee and rumor has it that you’re possibly a bigger Buffy fan than I am and that your desk proves it. Care to share?
VF: I don’t have all my Buffy stuff here, but Fiesta Giles with a Chainsaw is my favorite, so I have to keep him around. I think that’s season one Buffy. She has a heart pattern on her pants. I don’t know what she was thinking.
(Thanks to Vanessa for the pic of Fiesta Giles and Buffy)
RAE: Thanks a lot for taking the time to answer some questions for bloggers of the world that haven’t had time to keep up with the latest and greatest happenings with Google Sitemaps. It’s a busy world in the blogosphere these days and it’s nice to have the advantages specifically for bloggers (be it personal or professional) summed up.
Note: Since these interview questions were initially answered, Sitemaps did a release of some new features that may have a particular interest to bloggers including that they now show query stats at the subdirectory level, so for instance, if your blog is at mattcutts.com/blog, you can add that as a site and see query stats for that directory. Hopefully I can get Vanessa to comment about that a little more in the comments here. ;-)