SEO Myth: There is No Duplicate Content Penalty
This is probably old news to black hats, but I often hear people say “there’s no duplicate content penalty.” Newbies worry they’ll incur some kind of penalty for having identical copyright text across 100 pages or something, and other people like me jump in to alleviate their fears: “Google doesn’t penalize duplicate content; it filters them out.”
However, a few months ago, back when I still believed supplemental results were largely due to duplicate content, I ran a test to try to figure out exactly what % similarity I had to hit for pages to squeeze into the main index. I created a directory with several pages: one original page, then several other similar pages of varying similarity, from 90% similar down to 20%. Initially, all the pages got into the main index. Then after a few months, the entire directory poofed. If Google was filtering duplicate content, then I’d assume the original page, at least, to remain indexed. I also expected a page that was only 20% similar to stay in the index. But no. Every page in that directory disappeared from both the main and supplemental index. At that point, I suspected that Googlebot was refusing to index any page in that directory.
Here’s the logic. Say I have a site with 245,230,991 pages and at least 60% of those pages are very similar. Does Googlebot really want to spend the time and effort to crawl all those pages? Keep in mind, since Big Daddy, Google’s been very picky with what pages to crawl and index. PageRank became an anti-spammer weapon built to protect Googlebot from crawling thousands of low-value spam pages with nothing but guest book links pointing at them. So if Googlebot thinks that a good number of pages within a directory are too similar, then it would make sense to not only filter those pages out but to prevent future crawling of any pages in that directory.
Caveman says something similar in this post started way back in Sep. 29, 2005 (several months before Big Daddy):
The fact that even within a single site, when pages are deemed too similar, G is not throwing out the dups - they’re throwing out ALL the similar pages…if they find four pages on the same site about a certain kind of bee, and the four pages are similarly structured, and one is a main page for that bee, and the other three are subpages about the same bee, each reflecting a variation of that bee, the site owner now seems to run the risk that they will find all of the pages too similar, and filter them all, not just the three subpages.
Anyway, today I was re-reading a Webmasterworld thread regarding Adam Lasnik’s Duplicate Content post, and happened on a few interesting comments Adam wrote:
Some guy asked: Why not build into your webmaster toolkit something like a “Duplicate Content” threshold meter?
The fact that duplicate content isn’t very cut and dry for us either (e.g., it’s not “if more than [x]% of words on page A match page B…”) makes this a complicated prospect.
Todd wrote about this 6 months ago:
If it was as easy as saying that any page with more than 42% duplicate content will be filtered from the search results, then all site owners and SEO’s would probably grab 40% duplicate content for every page filler. It IS NOT a percentage.
In regards to duplicate content penalty (emphasis mine), Adam says:
As I noted in the original post, penalties in the context of duplicate content are rare. Ignoring duplicate content or just picking a canonical version is MUCH more typical…Again, this very, very rarely triggers a penalty. I can only recall seeing penalties when a site is perceived to be particularly “empty” + redundant; e.g., a reasonable person looking at it would cry “krikey! it’s all basically the same junk on every page!”
So if I take Adam’s word for it, Google does penalize sites for duplicate content, though its a once on a DVD night kinda thing (I cancelled my Netflix like a year ago).