June 19th, 2006

Duplicate Content in Blogs: Some Solutions

This comes a day late since “tomorrow” after my previous post discussing the duplicate content problem on blogs (which was published on the 17th) would be yesterday, but that’s what you get in World Cup season, eh?

Anyway, let’s get back to the main issue at hand for today. I said earlier that while many (if not all) of the most popular blog CMSes (Content Management Systems) automatically create duplicate content through multiple archive pages and RSS feeds, there are things that can be done about the problem – that is, if you were bothered enough to take action.

In fact, there are some very obvious ways of tackling the problem, e.g. removing multiple archive pages entirely, but they tend to degrade your blog’s accessibility (too much). So, here are a couple of more rational steps you might want to take (in no particular order) to prevent search engines from “seeing” substantial amounts of duplicate content on your blog(s):

  1. No-Index Flagging
    One of the most direct ways of going about this issue is to simply tag the duplicate content with a noindex flag, whether at the robots.txt level or as a meta tag in the headers of the respective file. However, this method is questionable in the wake of heavy discussion that certain search engines (a.k.a. Google) may not be competely obeying the noindex, nofollow tags (though many agree that this could very well be a simple bug).

    After my previous post, one reader, through e-mail, raised the fact that having duplicate content also creates multiple pages which others could link to, and thus, causing PageRank to be shared. Indeed, setting a noindex tag doesn’t solve this problem, but I feel that this problem is less relevant on blogs since external links tend to come from audiences who are aware of the importance of linking to an individual post page (and thus, gain trackback).

  2. Excerpts
    One way of differentiating the content between your “main” page (i.e. the one you want search engines to target) and your duplicate content pages is to use excerpts. Since for most blogs, the page you’d want SEs to target is your individual post page, you could easily set your archive pages to display only excerpts of content. Many popular WordPress blog themes (e.g. K2) already do so (though full content is shown on archive pages for both WordPress and Movable Type by default), and it isn’t difficult to set it up yourself in WordPress. If you want, you can even carry this step over to your feeds and setup partial feeds. However, since I’m a firm advocate of full feeds, I wouldn’t recommend this step unless you feel it is strictly necessary (or intend to do so anyway due to other reasons, e.g. content scraping).

So, are there any other steps that could be taken to solve the duplicate content problem in blogs?

Edit: Late last year, there was some discussion among the more influential people in the RSS feed circle on setting up a standard no-index flag for feeds. But I believe, so far, it has remained as mere discussion.

If you found this post useful, keep updated with future posts by subscribing to blogHelper (for free) through RSS or email.

Remember to share this post as well (if you liked it, of course): These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Fark
  • Furl
  • Ma.gnolia
  • NewsVine
  • Reddit
  • YahooMyWeb

1 Comment

  • 1

    [...] Archive EffectivelyOne of the suggestions here is to tackle the pagination issue, which is deemed harmful to search engine (SE) traffic since it provides “constantly changing, duplicate content pages), via the “noindex” meta tag or robots.txt file. In June, I wrote on the duplicate content problem in blogs (and provided some possible solutions), but didn’t consider pagination to be an issue then. While I still don’t think it’s an issue now, it could be an issue for blogs already in trouble with SEs. [...]

Leave a Reply