| Duplicate Content |
|
Duplicate Content in a Post-Panda World One of the SEO issues that has been ignoring for long is this duplicate content issue. This SEO problem has been around for years and the way Google handles it has evolved dramatically and seems to only get more complicated with every update. I. What Is Duplicate Content? Duplicate content exist when any two or more pages share the same content.
This simple concept of duplicates cause difficulty because people often make the mistake of thinking that a “page” is a file or document sitting on their web server. To a crawler (like Googlebot), a page is any unique URL it happens to find, usually through internal or external links. Especially on large, dynamic sites, creating two URLs that land on the same content is surprisingly easy (and often unintentional).
II. Why Do Duplicates Matter? Duplicate content has taken many forms as the algorithm has changed. Brief issues with duplicate content over the years…
The Supplemental Index In early days, indexing the web is massive computational challenge. To deal with this the duplicate content or were stored in a secondary called “supplemental” index and became as a 2nd class. This cause to lose any competitive ranking ability.
The Crawl “Budget” Google has no absolute crawl budget of pages to be crawled on a site. However, there’s a point that Google may give up crawling the site for a while if the spiders are going down the winding paths. Although itsn’t absolute, you can always check on Google Webmasters Tool to check the crawl Stats. So if Google encounters a lot of duplicate content the pages you want to indexed may not be crawled and won’t be crawled as often.
The Indexation “Cap” There’s no set “cap” to how many pages of a site Google will index. But there seem to be dynamic limit that is relative to the authority of the site. Indexing page with useless, duplicate pages, you may push out more important, deeper pages. If you load up on 1000s of internal search results, Google may not index all of your product pages. Many people make the mistake of thinking that more indexed pages is better. I’ve seen too many situations where the opposite was true. All else being equal, bloated indexes dilute your ranking ability.
The Penalty Debate Long before Panda, a debate would erupt every few months over whether or not there was a duplicate content penalty. While these debates raised valid points, they often focused on semantics – whether or not duplicate content caused a Capital-P Penalty. While I think the conceptual difference between penalties and filters is important, the upshot for a site owner is often the same. If a page isn’t ranking (or even indexed) because of duplicate content, then you’ve got a problem, no matter what you call it.
The Panda Update Since Panda (starting in February 2011), the impact of duplicate content has become much more severe in some cases. It used to be that duplicate content could only harm that content itself. If you had a duplicate, it might go supplemental or get filtered out. Usually, that was ok. In extreme cases, a large number of duplicates could bloat your index or cause crawl problems and start impacting other pages.
Panda made duplicate content part of a broader quality equation – now, a duplicate content problem can impact your entire site. If you’re hit by Panda, non-duplicate pages may lose ranking power, stop ranking altogether, or even fall out of the index. Duplicate content is no longer an isolated problem.
III. Three Kinds of Duplicates 1. True Duplicates – 100% identical and only differs by url 2. Near Duplicates – Differs only by small amount by either a text or image 3. Cross-domain Duplicates - A two websites that shares the same content.
IV. Tools for Fixing Duplicates 1. 404(Not Found) – The simplest is to remove it and return a 404 2. 301 Redirect– Another way to remove a page is to redirect to another location. 3. Robots.txt– Block the search crawlers using a robots.txt file where you can block an entire folder or URL. But not great for removing content already in the index. But some search engine frown on its overuse, and don’t recommend using for duplicate content. 4. Meta Robots – control the search bots at the page level using a directive “Meta Robots” tag <meta name=”ROBOTS” content =”NOINDEX, NOFOLLOW”/> This tells the search bots not to index this page or follow links on it. Other option is the content value as “NOINDEX, FOLLOW” whihch crawl the paths on the page without indexing the page. 5. Rel=Canonical– It is a tag that goes in the page header like Meta Robots. <link rel=”canonical” href=http://www.website.com/> When search engines arrive on a page with a canonical tag, they attribute the page to the canonical URL, regardless of the URL they used to reach the page. So, for example, if a bot reached the above page using the URL “www.example.com/index.html”, the search engine would not index the additional, non-canonical URL. Typically, it seems that inbound link-juice is also passed through the canonical tag.
It’s important to note that you need to clearly understand what the proper canonical page is for any given website template. Canonicalizing your entire site to just one page or the wrong pages can be catastrophic.
6. Google & Bing URL Removal– You can request that an individual page be manually removed from the index. 7. Parameter Blocking– You can also use Google’s GWT to specify URL parameters that you want Google to ignore (which essentially blocks indexation of pages with those parameters). In the same section of Bing’s BWC (“Index”), there’s an option called “URL Normalization”. The name implies Bing treats this more like canonicalization, but there’s only one option – “ignore”.
8. Rel=Prev & Rel=Next– Just this year (September 2011), Google gave us a new tool for fighting a particular form of near-duplicate content – paginated search results. I’ll describe the problem in more detail in the next section, but essentially paginated results are any searches where the results are broken up into chunks, with each chunk (say, 10 results) having its own page/URL.
You can now tell Google how paginated content connects by using a pair of tags much like Rel-Canonical. They’re called Rel-Prev and Rel-Next. Implementation is a bit tricky, but here’s a simple example: <link rel=”prev” href=”http://www.example.com/search/2”/ > <link rel=”next” href=”http://www.example.com/search/4”/ >
In this example, the search bot has landed on page 3 of search results, so you need two tags: (1) a Rel-Prev pointing to page 2, and (2) a Rel-Next pointing to page 4. Where it gets tricky is that you’re almost always going to have to generate these tags dynamically, as your search results are probably driven by one template.
While initial results suggest these tags do work, they’re not currently honored by Bing, and we really don’t have much data on their effectiveness. I’ll briefly discuss other methods for dealing with paginated content in the next section.
9. Internal Linking– it’s important to remember that your best tool for dealing with duplicate content is to not create it in the first place. Granted, that’s not always possible, but if you find yourself having to patch dozens of problems, you may need to re-examine your internal linking structure and site architecture.
When you do correct a duplication problem, such as with a 301-redirect or the canonical tag, it’s also important to make your other site cues reflect that change. It’s amazing how often I see someone set a 301 or canonical to one version of a page, and then continue to link internally to the non-canonical version and fill their XML sitemap with non-canonical URLs. Internal links are strong signals, and sending mixed signals will only cause you problems. 10. Don’t Do Anything – Finally, you can let the search engines sort it out. This is what Google recommended you do for years, actually. Unfortunately, in my experience, especially for large sites, this is almost always a bad idea. It’s important to note, though, that not all duplicate content is a disaster, and Google certainly can filter some of it out without huge consequences. If you only have a few isolated duplicates floating around, leaving them alone is a perfectly valid option. V. Tools for Diagnosing Duplicates 1. Google Webmaster Tools– In Google Webmaster Tools, you can see the list of crawled page that has duplicate Title tags and Meta Descriptions. Just go to “Diagnostics”>”HTML Suggestions” and you’ll see something like this:
2. Google’s Site: Command– To find out if Google has indexed any copies of your home-page, you could use the “site:” command with the “intitle:” site:example.com intitle:”Home Page title” site:example.com inurl:sort= site:example.com inurl:https site:example.com “this is a block of content”
3. SEOmoz Campaign Manager– If you’re an SEOmoz PRO member, they additional tools to help spotting duplicates in your site 4. Your Own Brain– Last is the most important to remember that not to forget using your own brain to find the duplicate content. Using tools alone isn’t perfect and sometimes always leaves some gaps in what you can find. A critical step is to try navigating your site to find those duplicate pages.
That’s the Brief Review This is only a brief overview regarding Duplicate Content in a Post-Panda World from seomoz.org by Dr. Pete. For in-depth full review and guide please download the document or visit here
|



