Google has long driven the bulk of visitors to my flagship website, BigBlueBall.com. Recently, one of my moderators noticed BigBlueBall falling in rank on Google search results. Nothing alarming, but curious nonetheless. I did some digging around and found some interesting information.
First, I did a little keyword analysis and found BigBlueBall had dropped out of site on some strings, like msn messenger, where we come up as the 318th listing. Might as well not even be listed at that point.
Next I ran across a great post by skyhawk133 on The Admin Zone describing suspected penalties by Google for what is perceived as duplcate content. Supposedly, if Googlebot finds two pages with identical content on the same domain, both pages get dropped from the index. This makes sense if you want to discourage search engine tweakers from gaming the system to get higher ranking for their site. Unfortunately, skyhawk133 had inadvertently created an environment where two different URLs could lead to the same content. The culprit? Mod_rewrite.
Mod_rewrite is an Apache module that let’s you reformat an otherwise confusing URL, turning it into something mere mortals can comprehend. I used this for the forums at BigBlueBall with two goals in mind. I wanted to create user-friendly URLs that provided rich information about the link, and keyword-rich URLs that would improve the rank of those discussions in search results.
Here’s an example of how this works in practice. Every discussion in the forum is assinged an ID number, and normally the URL retrieives the discussion you want by passing that ID in the querystring, like this:
http://www.bigblueball.com/forums/showthread.php?t=31754
This is functional, but it doesn’t provide any keywords nor any clue to the visitor who sees the link as to the nature of the topic. So I used mod_rewrite to create an alternate version that looks like this:
http://www.bigblueball.com/forums/t31754-pc-world-recommends-bigblueball.html
Granted, it’s not the prettiest URL you’ve seen, but it does provide both keywords and increased usability. I can look at the URL and get a pretty good idea what to expect when I click it.
The problem is that if both links are visible somewhere on the website, the Googlebot spider will find them and consider the two URLs duplicate links. I did some checking, running queries on the big three search engines to see how many pages they’ve each indexed at BigBlueBall.com. Interesting results:
As you can see, Yahoo is good to us, Google not quite so. The real surprise was MSN Search, where we really lag behind.
The solution? I’m not 100% certain, but I’m testing some exclusion directives in the robots.txt file. This file, if it exists in your web root, tells well-behaved spiders like Googlebot what it may and may not index. By excluding URLs containing “showthread” I effectively eliiminate the possibility of the (perceived) duplicate URLs. With some luck, the next Google dance will prove this theory correct.
Marcus says
That is very curious. You would think that a search engine as advanced as Google, would have made provisions for such a popular mishap. Hopefully the robots file will fix the problem, I’ll have to remember that little trick for future sites!
Marcus says
That is very curious. You would think that a search engine as advanced as Google, would have made provisions for such a popular mishap. Hopefully the robots file will fix the problem, I’ll have to remember that little trick for future sites!
Jeff says
Marcus, it’s not really a “mishap” that Google hasn’t accounted for. Although I’m not actually duplicating content, that is a “technique” that some inscrupulous websites employ to artificially boost their rank. I believe Google is correct to penalize such abuse.
Jeff says
Marcus, it’s not really a “mishap” that Google hasn’t accounted for. Although I’m not actually duplicating content, that is a “technique” that some inscrupulous websites employ to artificially boost their rank. I believe Google is correct to penalize such abuse.
Marcus says
No, I agree that purposeful duplication should be penalized, but your content isn’t actually duplicated, it just has two addresses to get to the same content. I can see how that can be confused with duplicated content, but how many forums and other content management system employ the same mod-rewrite scheme for friendlier links? I’d say a heck of a lot. Google should be advanced enough to detect the difference between mod-rewrite and true duplication.
Though, I suppose as long as your linking scheme is consistent to one set of links before being indexed, the Googlebot won’t ever find the duplicate links, right? So perhaps the problem isn’t quite as “popular” as I initially thought. Worth keeping an eye on anyway though!
Marcus says
No, I agree that purposeful duplication should be penalized, but your content isn’t actually duplicated, it just has two addresses to get to the same content. I can see how that can be confused with duplicated content, but how many forums and other content management system employ the same mod-rewrite scheme for friendlier links? I’d say a heck of a lot. Google should be advanced enough to detect the difference between mod-rewrite and true duplication.
Though, I suppose as long as your linking scheme is consistent to one set of links before being indexed, the Googlebot won’t ever find the duplicate links, right? So perhaps the problem isn’t quite as “popular” as I initially thought. Worth keeping an eye on anyway though!