If you have PDF versions of articles on your site there is always a chance that they may get indexed and flagged as being duplicates and therefore impact upon the ability for either the parent page or the PDF to rank.
Additionally, PDFs generally make for poor landing pages and should a user arrive at a PDF page on your site there is no branding, no navigation and no way for them to consume any more of your content or do pretty much anything other than click back to the search results – not a good thing.
Further still, links could end up pointing at the PDF lessening the ability of the main page to rank for it’s chosen terms, serving up the PDF in it’s place or just wasting link equity that should rightfully be allocated to the HTML version of the article.
All in all, these problems can cause a loss of deserved traffic to your hard built and spectacular content so this is a problem. Fortunately, it is a problem with a simple solution.
1. Prevent Indexation of Your PDFs
The first and totally cross platform simple solution is to simply block access to your PDF files in robots.txt. This is a quick and simple fix that ensures that the PDFs will not be crawled and are less likely to be indexed and therefore folks should always find the parent page with the download link.
We have two options here and we can either block access to the PDF file directly or we can block access to a folder that contains PDFs. Both will work but I tend to prefer organising PDF files into a folder and blocking access there as it is pretty much a fire and forget solution and as you add new PDFs you can simply link to them safe in the knowledge they will not be indexed.
Block a folder in robots.txt
User-agent: * Disallow: /pdfs/
Block a file in robots.txt
User-agent: * Disallow: /pdfs/yourfile.pdf
That at least deals with any potential duplication and ensures these pages don’t get indexed and that users will only arrive at the correct page.
2. Allocate all link equity to the parent page
Whilst the PDFs are less likely to be indexed with the above solution they will still be available on a given URL so we may still end up with inbound links pointing at these pdfs – if we don’t index these pages then we will lose the benefit of these links so we want to allocate the link equity to the correct page.
The best way to do this is by setting the canonical URL for the PDF to the parent article in your .htaccess file (linix & *nix hosting) and this is achieved fairly easy by adding the following to your .htaccess file.
Assuming we have:
Where pdf-article.pdf is a duplicate of article.html and article.html is the page we want returned in the search results pages.
<Files /pdfs/pdf-article.pdf> Header add Link '<http://www.yoursite.co.uk/article.html>; rel="canonical"' </Files>
Which solution should you use?
Well, thanks to some feedback in the comments below from the ever knowledgeable Jim Hodson from www.canonicalSEO.com it would seem that these solutions don’t really work very well in tandem and that option 2 is the best approach here. This means that Google can crawl both and take the canonical directive from the PDF and use that to assign all value to the parent page.
The following is a comment (from the comments) from Jim Hodges:
“If the PDF is blocked in robots.txt, Googlebot will never request that PDF even though they may know it exists because of external “followed” links. If they never request the PDF URL because it’s blocked by robots.txt then they will never see your rel=”canonical” in the HTTP header.”
I think that’s a very smart observation that I overlooked when I wrote the original post so folks, pop over to Jim’s site as there is certainly much to learn over there.
It’s worth noting that I have seen instances where the second approach just won’t play nice on some servers so robots.txt may be all your are left with but if you can use this second approach, it is certainly the one to go for.
I am no windows guy and have managed some basic rewrites with a web.config file but again we have some great insight from Jim with regards to issuing header directives from IIS server:
The analogy to an Apache “module” on IIS is an “ISAPI filter”. Using .NET, your can write an ISAPI filter which becomes part of IIS. Similar to Mod_Rewrite, your ISAPI filter can intercept page requests and do all sorts of things… URL rewriting, URL redirection, modify HTTP headers (which is what you need in this case), etc.
Another solution if you manage your own Windows/IIS servers is that you can purchase and install ISAPI Rewrite, a utility from HeliconTech.com. It’s a Mod_Rewrite compatible utility (written as an ISAPI Filter) that reads .htaccess files and processes them just like Mod_Rewrite (with a few tiny exceptions due to operating system differences between Linux and Windows). But I’d say it’s 99+% compatible… and very inexpensive. The syntax for ISAPI Rewrite’s .htaccess files for IIS is identical to that of Mod_Rewrite’s .htaccess files on Apache.
No duplication and correct link allocation
So, as it turns out this is a fairly cute solution that ensures that you have no problems with duplication or having two versions of the same content competing with each other, we serve users the best version of the content and we allocate any links to the html version ensuring that ranks as well as it deserves.
Win a Free SEO Audit
Starting in March 2013 we are going to give one of our lucky followers a free comprehensive SEO Audit so to be in with a chance to win please be sure to follow me on Twitter, Facebook and Google+. Oh, and please share the article if you found it helpful!