Rae Hoffman

Googlebot Gone Wild on Wordpress Backend Files

by Rae Hoffman on September 17, 2007 | General Marketing Babble

I’ve been noticing a problem on several of the blogs I’m running lately in regards to their indexing at Google. Granted, I’m not a big one for running wordpress as most of my sites run on a custom built CMS system, but I’ve been running a few for quite a while and what I’m seeing lately in regards to their indexing seems odd to me.

The problem is that Googlebot has begun indexing the backend files of the blog. Some example urls I have begun finding indexed (and yes, I checked to make sure they had no inbound links and according to Yahoo, they don’t) are:

/wp-content/plugins/webprof-delicious/webprof-delicious2.php
/wp-content/cache/wp-cache-cdecddb6bcc0b95b96cdcb347.meta
/wp-content/cache/wp-cache-cdecddb6bcc0b9b4dcdcb3427.html
/wp-content/themes/default/

And all those urls have some great title and description tags like:

Fatal error: Call to undefined function: get_header() in /home …
Fatal error: Call to undefined function: get_header() in /home/username/html/wp-content/themes/default/index.php on line 1.

Warning: main(ABSPATHWPINC/rss-functions.php) [function.main ...
Warning: main(ABSPATHWPINC/rss-functions.php) [function.main]: failed to open stream: No such file or directory in …

The indexed cache pages of course have dupe titles, descriptions *and* content as their non cached counterparts. But, hey, there is no duplicate content penalty right? [sarcasm]if my pages can’t be found as a result of it, you can call it a penalty or a “spoonful of love” – it has the same meaning to me[/sarcasm]

1. What the hell is Google doing crawling these pages and files?

2. Anyone else seeing this (there are millions of pages indexed, so someone else must be) and if so, for how long have you been seeing it?

3. If by some chance, this was coming from toolbar data, wouldn’t they think to set up an exclusion for one of the largest blog platforms in existence to not have these pages indexed and wasting space?

4. I’m not seeing the toolbar as an explanation in all honesty because I’ve never visited some of these urls in the backend – they’re not even “navigatable” … so how the hell is Google “finding them” to begin with, considering the whole point of “crawling” for urls?

Subscribe to the Sugarrae feed | Follow Sugarrae on Twitter

Related Posts

Sugarrae runs on the Thesis WordPress Theme

Thesis WordPress theme

If you’re someone who doesn’t understand a lot of PHP, Thesis will give a ton of functionality that you wouldn’t be able to obtain otherwise with a simple control panel instead of having to alter code. For the advanced, Thesis has incredible customization possibilities via Thesis hooks.

For those "in between", like myself, I’ve created "dummy" guides for Thesis hooks that allow us to make more professional customizations than we ever deemed possible. The theme is not only highly customizable, but it has allowed me to run Sugarrae more professionally, with a much more targeted focus on monetization than it ever has been able to achieve before. You can find out more about Thesis below:

{ 20 comments… read them below or add one }

1 graywolf September 17, 2007 at 3:10 pm

when you leave crap like this open is it any wonder it gets crawled

http://www.sugarrae.com/wp-content/plugins/

(I am so catching a beating for that)

Add this to your htaccess file

Options All -Indexes

2 Rhea Drysdale September 17, 2007 at 3:18 pm

This probably isn’t very helpful, but I’m not getting the same. When did you first notice this? I doubt my sites are getting crawled as often or as deeply as yours. It might just be a matter of time. Very odd though. Google should definitely have a way to block those standards Wordpress folders.

3 Rhea September 17, 2007 at 3:31 pm

Good advice from Michael… as I run off copying it.

4 Rae Hoffman September 17, 2007 at 3:40 pm

I get what you’re saying MG… but the whole point of “crawling” is to follow links… why would I block every sub directory on my site when in theory, Google shouldn’t be crawling anything I haven’t linked to? And what about those regular people running websites without professional help? You’re supposed to know to block a file you may not even know exists? (such as acgi-bin file, as an example)… I know why they crawl a url if left open – my question is how they’re deciding to crawl random urls not linked to.

5 Rae Hoffman September 17, 2007 at 3:50 pm

Oh, and thanks for linking to it on THIS site, which did NOT have a problem… you ARE getting kicked MG :P

6 Vanessa Fox September 17, 2007 at 5:13 pm

Any chance the site has a wordpress sitemap plugin installed that’s adding those urls to the sitemap file?

7 graywolf September 17, 2007 at 5:29 pm

de link it, it was purely for example purposes. Was it on the site that had the problem of feeds being crawled as well?

we both know they don’t get url’s to crawl just from on page links anymore :-P

8 Rae Hoffman September 18, 2007 at 9:15 am

Vanessa, no on the sitemap plugin… Gray, it was on the site that I’m calling you in five for debugging help on… so you should know which one it is now ;-)

9 Elixir Blogger September 19, 2007 at 7:41 pm

Since I’m not involved in the tech side here, I’ll definitely pass this information on. Thanks for the heads up.

10 Teli September 20, 2007 at 12:01 pm

Rae, you raise a damn fine point and couldn’t tell you why G-bot would crawl the directories because they’re not even linked in the head meta — only direct files.

One solution I use across my blogs is simply to install my WordPress files (i.e. the admin stuff) in its own folder and block access via robots.txt — and in some cases, password protection.

The suggestion about denying directory browsing would also work.

~ Teli

11 Jeff O'Hara September 21, 2007 at 12:52 pm

Just an FYI I am not a SEO person nor do I ever aspire to be one. I am a technologist. I hate to burst your bubble but google indexing the wp-content directory is perfectly acceptable it is an unprotected “content” directory for uploaded user files. It is not password protected and If you look at what is contained in these directory’s it could be pdf’s, txt files, images, plugins, cache, etc. I would be worried when they start crawling your wp-admin directory and cacheing it. If you do not want google to cache this directory, put it in your robots.txt file.

-Jeff O’Hara
http://blog.zemote.com

12 Rae Hoffman September 21, 2007 at 1:17 pm

I know I didn’t have the file blocked… my point was more about how since there were no links to these sections, in theory, google should not be “crawling” them – since there was no branch to “crawl” to reach them. :)

13 deep.thought September 26, 2007 at 9:36 am

Wouldn’t the links be in the header, where the Javascript is called? Surely once mr. bot is in the folder he would be compelled to look around.

That would explain the calls to undefined functions listed next to the index entries.

Just a thought.

14 Joost de Valk October 1, 2007 at 5:20 am

Didn’t you link to an uploaded file from a post for instance? I’ve seen Google “try” directories before, where they had found a link to example.com/files/file1.pdf, and started crawling example.com/files/ too as a result of it…

15 Mr. Gunn October 2, 2007 at 1:35 pm

I just recently noticed this myself. In my case, I was generating a sitemap from my server logs, so the URLs got included that way.

I wrote a list of google sitemap filter expressions to clean that stuff out, which you’re welcome to steal.

16 Lars Bachmann October 15, 2007 at 5:28 pm

Wordpress does have a seriosly problem with indexing directories that are not linked to. I dont know why..
But the easiest thing must be to fix it with a robots.txt or .htaccess file.
Great blog by the way.

17 Matt November 9, 2007 at 7:17 pm

I have seen something similar on backlinks, I posted a comment on one of the blogs with a link to a website. Few days later when I checked the backlinks I discovered that the page had links from very strange non existent files originated at that blog. It was all gone after couple of days. ??

18 Swaminathan Viramani March 14, 2008 at 4:03 pm

Rae

You should block googlebot from visiting your site. I have experienced they play dirty, and seem to gather a lot of stuff from your sites, even things that are not supposed to be gathered.

(by the way, did you know that when you see no pages indexed using the site:domain.com search, it does not mean pages are not indexed. I found this by accident when my stats showed hits on a particular page. When I checked out with the keyword for that page, the page was showing in Google. How is that for the sneaky tricks SEs play?)

Live on other means of traffic, girl. Google is only part of it.

Good luck

Swaminathan

19 mavi September 16, 2009 at 8:14 am

I’ve the same problem.

So how do we remove the already indexed files in Google from Google? I have added the plugin folders to robots.txt but it is not possible to add a noindex meta to all these plugins. Robots does not remove the indexed files however.

What else can we do to remove them quickly? Any advice?
Thanks

20 Rae Hoffman September 16, 2009 at 9:54 am

Hey mavi – get a Webmaster Central account and request removal of the blocked directories through there… should get it cleaned up within a few days.

Leave a Comment

Want to add a picture to your comments here on Sugarrae? Upload a picture at Gravatar to make it happen.

Please note that by clicking submit, you agree to abide by the comment policy.