Google has recently stated their preference for sites to run over the secure HTTPS protocol and gone so far as to promise a small ranking boost to those that do so. Evidence to date proves this really is a very small signal so no need to go into a non-HTTPS tailspin over this just yet but looking to the future we probably want to heed the search gods advice and move in that direction. What we want to ensure though is whether you stay HTTP, implement HTTPS as an option or move fully over to HTTP that you only have the one version indexed.
The problem with having two distinct protocols with which your website can be viewed over is that from an external perspective we have two different sites. If we end up with two sites indexed we can see problems with duplication. We can also see a split of equity where some value goes to HTTP and some go to HTTPS. This can impact visibility in search, hinder SEO efforts and generally hit you in the pocket.
When good websites go bad
In practice, a single issue like this should not be an issue but if all sites only had one simply fixable issue I would be a happy man. Or out of work. One or the other. The fact is we see more issues now than ever before so this tends to compound other issues.
The following is an example from a site we worked on recently that had lots of duplication issues. The names have been changed to protect the innocent but the facts are the facts.
The site example.com is available on four URL variations:
http://example.co.uk http://www.example.co.uk https://example.co.uk https://www.example.co.uk
Internal navigation is seemingly dynamic and uses relative links so whatever version of the site we land on the navigation reinforces that link structure. The site has a canonical URL but it is seemingly dynamically applied so again if you are on http://example.co.uk then that is the canonical but if you are on https://www.example.co.uk that is the canonical.
We also have lots of URL filtering variables & a broken category that generates an endless category loop that gets ever deeper: www.example.com/category/ becomes www.example.com/category/category/ becomes www.example.com/category/category/category/ and on forever and a day.
We have lots of problems here
- four URL variations the site can be accessed on
- this is compounded by the canonical and internal navigation
- lots of filtering variables creating almost endless URL variations
- a never ending loop creating ever deeper categories
It’s quite the mess. A site that should have around 2000 canonical pages actually has over a million crawlable URLs and over 150,000 pages of indexable content (I use the word content very loosely here).
The real world implications of this were around 150,000 indexed pages (most of which were junk) and none of the actual products being returned in the search which was the main commercial reason for the site to exist. We performed a comprehensive SEO audit on this site and whilst we found other small optimisation issues it was the duplication and accessibility/crawl issues that were bringing this once strong online business to its digital knees.
When we consider the site had HTTP and HTTPS versions of both the naked and www version of the URL then the problem could quadruple.
A single, simple fix for all scenarios
Fortunately, big issues don’t always need big solutions and taking care of the HTTP and HTTPS problems was easy. We simply implemented a fixed canonical URL across all versions of the site. It was decided that the https://www.example.com/ version would be used and instead of a dynamic canonical (which is a bonkers idea and utterly defeats the purpose) a static canonical was used for the protocol, subdomain and domain aspect of each link. Suddenly we have one version of the site that can be indexed and not four potentially solving 75% of the problem.
Dealing with the duplication
<link rel="canonical" href="https://www.example.com" />
<link rel="canonical" href="https://www.example.com/page/" />
One protocol. One subdomain.
A few quick pointers on handling this as well as possible across all protocol/domain variations:
- Always use an absolute URL
- Always show the preferred version of the URL in the canonical
- Ensure your internal navigation matches the preferred URL
- Ensure your sitemap matches the preferred URL
- Ensure your use a consistent protocol (HTTP or HTTPS)
We see a lot of poorly implemented canonical URLs that can make problems worse so be sure to understand exactly how the search engines want to see these implemented to help, rather than hinder in the accurate indexation of your content.
This is not the only way to do this but in my experience, it is the simplest and best and is after all what the canonical URL exists for. This collates any equity in the four versions of each page and deals with indexation + duplication issues. Groovy!
Few other small fixes
This site’s problems did not stop here so we had to fix the problems causing the never ending loop and ignore a bunch of the URL variables in Google
Webmaster Tools Search Console. All easy enough. When we then crawl the site and mimic what we have set up in webmaster tools we find only 2000 or so URLs. Great for crawl efficiency so we make the site simple to crawl, understand and index.
Just to be sure
In our experience, this is most problematic when it is one of many problems on a site and ensuring you have only one a single sub-domain with a simple 301 redirect is always sensible as is adding all versions of your site and setting the preferred version in Google’s
Webmaster Tools Search Console.
One protocol to rule them all!
Hopefully, this gives you all you need to remove any daft duplication issues across your protocols and sub domains. This is such a simple fix it is something of a gift but if you have any questions or just need a friendly eye to look over your canonical implementation then drop me a comment below.