If you're new here, you may want to subscribe to my feed or subscribe to me on Twitter, which is updated on a more frequent - and more meaningless - basis.
I’ve been noticing a problem on several of the blogs I’m running lately in regards to their indexing at Google. Granted, I’m not a big one for running wordpress as most of my sites run on a custom built CMS system, but I’ve been running a few for quite a while and what I’m seeing lately in regards to their indexing seems odd to me.
The problem is that Googlebot has begun indexing the backend files of the blog. Some example urls I have begun finding indexed (and yes, I checked to make sure they had no inbound links and according to Yahoo, they don’t) are:
/wp-content/plugins/webprof-delicious/webprof-delicious2.php
/wp-content/cache/wp-cache-cdecddb6bcc0b95b96cdcb347.meta
/wp-content/cache/wp-cache-cdecddb6bcc0b9b4dcdcb3427.html
/wp-content/themes/default/
And all those urls have some great title and description tags like:
Fatal error: Call to undefined function: get_header() in /home …
Fatal error: Call to undefined function: get_header() in /home/username/html/wp-content/themes/default/index.php on line 1.
Warning: main(ABSPATHWPINC/rss-functions.php) [function.main ...
Warning: main(ABSPATHWPINC/rss-functions.php) [function.main]: failed to open stream: No such file or directory in …
The indexed cache pages of course have dupe titles, descriptions *and* content as their non cached counterparts. But, hey, there is no duplicate content penalty right? [sarcasm]if my pages can’t be found as a result of it, you can call it a penalty or a “spoonful of love” - it has the same meaning to me[/sarcasm]
1. What the hell is Google doing crawling these pages and files?
2. Anyone else seeing this (there are millions of pages indexed, so someone else must be) and if so, for how long have you been seeing it?
3. If by some chance, this was coming from toolbar data, wouldn’t they think to set up an exclusion for one of the largest blog platforms in existence to not have these pages indexed and wasting space?
4. I’m not seeing the toolbar as an explanation in all honesty because I’ve never visited some of these urls in the backend - they’re not even “navigatable” … so how the hell is Google “finding them” to begin with, considering the whole point of “crawling” for urls?
-- Subscribe to the Sugarrae feed




{ 18 comments… read them below or add one }
when you leave crap like this open is it any wonder it gets crawled
http://www.sugarrae.com/wp-content/plugins/
(I am so catching a beating for that)
Add this to your htaccess file
Options All -Indexes
This probably isn’t very helpful, but I’m not getting the same. When did you first notice this? I doubt my sites are getting crawled as often or as deeply as yours. It might just be a matter of time. Very odd though. Google should definitely have a way to block those standards Wordpress folders.
Good advice from Michael… as I run off copying it.
I get what you’re saying MG… but the whole point of “crawling” is to follow links… why would I block every sub directory on my site when in theory, Google shouldn’t be crawling anything I haven’t linked to? And what about those regular people running websites without professional help? You’re supposed to know to block a file you may not even know exists? (such as acgi-bin file, as an example)… I know why they crawl a url if left open - my question is how they’re deciding to crawl random urls not linked to.
Oh, and thanks for linking to it on THIS site, which did NOT have a problem… you ARE getting kicked MG :P
Any chance the site has a wordpress sitemap plugin installed that’s adding those urls to the sitemap file?
de link it, it was purely for example purposes. Was it on the site that had the problem of feeds being crawled as well?
we both know they don’t get url’s to crawl just from on page links anymore :-P
Vanessa, no on the sitemap plugin… Gray, it was on the site that I’m calling you in five for debugging help on… so you should know which one it is now ;-)
Since I’m not involved in the tech side here, I’ll definitely pass this information on. Thanks for the heads up.
Rae, you raise a damn fine point and couldn’t tell you why G-bot would crawl the directories because they’re not even linked in the head meta — only direct files.
One solution I use across my blogs is simply to install my WordPress files (i.e. the admin stuff) in its own folder and block access via robots.txt — and in some cases, password protection.
The suggestion about denying directory browsing would also work.
~ Teli
Just an FYI I am not a SEO person nor do I ever aspire to be one. I am a technologist. I hate to burst your bubble but google indexing the wp-content directory is perfectly acceptable it is an unprotected “content” directory for uploaded user files. It is not password protected and If you look at what is contained in these directory’s it could be pdf’s, txt files, images, plugins, cache, etc. I would be worried when they start crawling your wp-admin directory and cacheing it. If you do not want google to cache this directory, put it in your robots.txt file.
-Jeff O’Hara
http://blog.zemote.com
I know I didn’t have the file blocked… my point was more about how since there were no links to these sections, in theory, google should not be “crawling” them - since there was no branch to “crawl” to reach them. :)
Wouldn’t the links be in the header, where the Javascript is called? Surely once mr. bot is in the folder he would be compelled to look around.
That would explain the calls to undefined functions listed next to the index entries.
Just a thought.
Didn’t you link to an uploaded file from a post for instance? I’ve seen Google “try” directories before, where they had found a link to example.com/files/file1.pdf, and started crawling example.com/files/ too as a result of it…
I just recently noticed this myself. In my case, I was generating a sitemap from my server logs, so the URLs got included that way.
I wrote a list of google sitemap filter expressions to clean that stuff out, which you’re welcome to steal.
Wordpress does have a seriosly problem with indexing directories that are not linked to. I dont know why..
But the easiest thing must be to fix it with a robots.txt or .htaccess file.
Great blog by the way.
I have seen something similar on backlinks, I posted a comment on one of the blogs with a link to a website. Few days later when I checked the backlinks I discovered that the page had links from very strange non existent files originated at that blog. It was all gone after couple of days. ??
Rae
You should block googlebot from visiting your site. I have experienced they play dirty, and seem to gather a lot of stuff from your sites, even things that are not supposed to be gathered.
(by the way, did you know that when you see no pages indexed using the site:domain.com search, it does not mean pages are not indexed. I found this by accident when my stats showed hits on a particular page. When I checked out with the keyword for that page, the page was showing in Google. How is that for the sneaky tricks SEs play?)
Live on other means of traffic, girl. Google is only part of it.
Good luck
Swaminathan
You must log in to post a comment.