Prevent PDF Articles becoming Duplicate Content

If you have PDF versions of articles on your site there is always a chance that they may get indexed and flagged as being duplicates and therefore impact upon the ability for either the parent page or the PDF to rank.

Additionally, PDFs generally make for poor landing pages and should a user arrive at a PDF page on your site there is no branding, no navigation and no way for them to consume any more of your content or do pretty much anything other than click back to the search results – not a good thing.

Further, still, links could end up pointing at the PDF lessening the ability of the main page to rank for it’s chosen terms, serving up the PDF in it’s place or just wasting link equity that should rightfully be allocated to the HTML version of the article.

All in all, these problems can cause a loss of deserved traffic to your hard built and spectacular content so this is a problem. Fortunately, it is a problem with a simple solution.

1. Prevent Indexation of Your PDFs

The first and totally cross platform simple solution is to simply block access to your PDF files in robots.txt. This is a quick and simple fix that ensures that the PDFs will not be crawled and are less likely to be indexed and therefore folks should always find the parent page with the download link.

We have two options here and we can either block access to the PDF file directly or we can block access to a folder that contains PDFs. Both will work but I tend to prefer organising PDF files into a folder and blocking access there as it is pretty much a fire and forget solution and as you add new PDFs you can simply link to them safe in the knowledge they will not be indexed.

Block a folder in robots.txt

User-agent: *
Disallow: /pdfs/

Block a file in robots.txt

User-agent: *
Disallow: /pdfs/yourfile.pdf

That at least deals with any potential duplication and ensures these pages don’t get indexed and that users will only arrive at the correct page.

2. Allocate all link equity to the parent page

Whilst the PDFs are less likely to be indexed with the above solution they will still be available on a given URL so we may still end up with inbound links pointing at these pdfs – if we don’t index these pages then we will lose the benefit of these links so we want to allocate the link equity to the correct page.

The best way to do this is by setting the canonical URL for the PDF to the parent article in your .htaccess file (linix & *nix hosting) and this is achieved fairly easy by adding the following to your .htaccess file.

Assuming we have:

www.yoursite.co.uk/article.html
www.yoursite.co.uk/pdfs/pdf-article.pdf

Where pdf-article.pdf is a duplicate of article.html and article.html is the page we want returned in the search results pages.

<Files /pdfs/pdf-article.pdf>
 Header add Link '<http://www.yoursite.co.uk/article.html>;
 rel="canonical"'
 </Files>

Which solution should you use?

Well, thanks to some feedback in the comments below from the ever knowledgeable Jim Hodson from www.canonicalSEO.com it would seem that these solutions don’t really work very well in tandem and that option 2 is the best approach here. This means that Google can crawl both and take the canonical directive from the PDF and use that to assign all value to the parent page.

The following is a comment (from the comments) from Jim Hodges:

“If the PDF is blocked in robots.txt, Googlebot will never request that PDF even though they may know it exists because of external “followed” links. If they never request the PDF URL because it’s blocked by robots.txt then they will never see your rel=”canonical” in the HTTP header.”

I think that’s a very smart observation that I overlooked when I wrote the original post so folks pop over to Jim’s site as there is certainly much to learn over there.

It’s worth noting that I have seen instances where the second approach just won’t play nice on some servers so robots.txt may be all your are left with but if you can use this second approach, it is certainly the one to go for.

Windows IIS

I am no windows guy and have managed some basic rewrites with a web.config file but again we have some great insight from Jim with regards to issuing header directives from IIS server:

The analogy to an Apache “module” on IIS is an “ISAPI filter”. Using .NET, your can write an ISAPI filter which becomes part of IIS. Similar to Mod_Rewrite, your ISAPI filter can intercept page requests and do all sorts of things… URL rewriting, URL redirection, modify HTTP headers (which is what you need in this case), etc.

Another solution if you manage your own Windows/IIS servers is that you can purchase and install ISAPI Rewrite, a utility from HeliconTech.com. It’s a Mod_Rewrite compatible utility (written as an ISAPI Filter) that reads .htaccess files and processes them just like Mod_Rewrite (with a few tiny exceptions due to operating system differences between Linux and Windows). But I’d say it’s 99+% compatible… and very inexpensive. The syntax for ISAPI Rewrite’s .htaccess files for IIS is identical to that of Mod_Rewrite’s .htaccess files on Apache.

No duplication and correct link allocation

So, as it turns out this is a fairly cute solution that ensures that you have no problems with duplication or having two versions of the same content competing with each other, we serve users the best version of the content and we allocate any links to the html version ensuring that ranks as well as it deserves.

Win a Free SEO Audit

Starting in March 2013 we are going to give one of our lucky followers a free comprehensive SEO Audit so to be in with a chance to win please be sure to follow me on Twitter, Facebook and Google+. Oh, and please share the article if you found it helpful! 🙂

Post Views: 260

Marcus Miller

Marcus is our Digital Strategist. He’s been working in the industry for nearly 20 years and wears many (bowler) hats as a highly technical developer and SEO, and even has a fancy computer science degree to prove it. He encourages continuous, ceaseless improvement and his predictions of trends are always spookily correct.

All Posts »

2 Responses

Jim Hodson says:

February 14, 2013 at 4:44 pm

Hello Marcus. Great post. Just wanted to clarify a couple of things and add an IIS solution for those in the Windows world.

The first thing that I wanted to clarify is that blocking the PDFs (or any URL for that matter) with robots.txt doesn’t guarantee they won’t be indexed and/or shown in the SERPs. It simply means they will not be crawled in the future.

If a URL (PDF or otherwise) is already indexed and then you block it with robots.txt (especially if they have external links), the URL will likely remain indexed unless you use the URL removal tool in Webmaster Tools to have it removed (remember to block them first THEN use the URL removal tool). But even this won’t guarantee the PDF won’t be shown in the SERPs.

Even if they are not indexed and you block them with robots.txt, if enough external sites link to the PDF and Google thinks your URL might be relevant to the query (based on the link text of those inbound links), they might still show the PDF URL in the SERPs. When this happens you’ll see a constructed title in the SERPs (typically based on a the search phrase and/or link text of the PDF’s external links) because Google can’t crawl the page to get the real title element. You’ll also notice that there is no snippet for the SERP item… again this is because Google can’t crawl the page to get your meta description or to read the content of the page to construct a snippet.

It should also be noted that if you go with the rel=”canonical” in the HTTP headers solution (approach #2 above) to solve the problem then you definitely do NOT want to also block the PDFs in your robots.txt

If the PDF is blocked in robots.txt, Googlebot will never request that PDF even though they may know it exists because of external “followed” links. If they never request the PDF URL because it’s blocked by robots.txt then they will never see your rel=”canonical” in the HTTP header.

So essentially… these two approaches are mutually exclusive.

As for IIS solutions…

As you know, in Linux environments… Mod_Rewrite is a standard Apache “module”. It’s a utility module whose code is maintained separately from Apache which essentially gets linked into and becomes part of an Apache build.

The analogy to an Apache “module” on IIS is an “ISAPI filter”. Using .NET, your can write an ISAPI filter which becomes part of IIS. Similar to Mod_Rewrite, your ISAPI filter can intercept page requests and do all sorts of things… URL rewriting, URL redirection, modify HTTP headers (which is what you need in this case), etc.

Another solution if you manage your own Windows/IIS servers is that you can purchase and install ISAPI Rewrite, a utility from HeliconTech.com. It’s a Mod_Rewrite compatible utility (written as an ISAPI Filter) that reads .htaccess files and processes them just like Mod_Rewrite (with a few tiny exceptions due to operating system differences between Linux and Windows). But I’d say it’s 99+% compatible… and very inexpensive. The syntax for ISAPI Rewrite’s .htaccess files for IIS is identical to that of Mod_Rewrite’s .htaccess files on Apache.

Hope that helps!

Reply
1. Marcus Miller says:
  
  February 20, 2013 at 10:25 am
  
  Hey, awesome comment, very cool. Will update the article. 🙂
  
  Reply