Block Most of a Directory in Robots.txt

In a world of content management systems (CMS) and plugins that can do nearly everything you want, instantly, often for free or a nominal cost it is easy to hang together a website with a raft of functionality without blowing your budget to pieces.

That’s great, it really is, but as anyone knows who works with search engines or websites for a while xi there is no such thing as a free lunch and there are trade-offs for this free functionality.

One problem we often see with sites that feature a lot of plugins and even sites that have been built by hand is that we end up with lots of automatically generated pages that often have little content (calendar plugins are a usual suspect here). The risks here are several but the most obvious are falling foul of some of the qualitative updates for thin content like Google Panda. If all pages on your site show adverts as well and you are generating practically empty advert pages then that could hurt even more. We also have to consider crawl budget and web crawlers get awful bored crawling infinite, empty, dynamic web pages.

The Ideal Solution

The ideal solution is to remove these pages and tweak your system so that they are not generated at all.

Robots.txt

Now, whilst we strive for perfection sometimes budgets don’t allow for it and we need simple but hopefully, effective hacks to resolve issues and robots.txt gives us a simple solution here. This allows us to send a message to the search engine that these pages should not be crawled or indexed with a few simple lines.

Now, where all of these thin content or dynamically generated pages exist under a single directory we can block them all in one fell swoop.

User-agent: *

Disallow: /directory/

Now, we are sending a message to Google and co. that we don't want any of those files in that directory - simple eh?

What if you want to rank some of the pages?

So, there is always one, always one page you need to keep that would scupper this solution but fortunately, we have a fix for that as well. Robots.txt processes instructions in groups and one at a time so we can tell it to ignore a whole directory but then allow a few pages we want to keep.

This following set of instructions will still block the directory but then allow us to keep a single HTML file and a single sub directory and all of the children of that.

User-agent: *
Disallow: /directory/
Allow: /directory/keep.html
Allow: /directory/subdir/

A quick and easy one for a Friday!

So, this is not the only option and possibly not the best but if you find yourself at the mercy of an SEO Audit and find that you have lots of junk pages indexed and need a quick and easy way to get them resolved then this should do the job.

Any questions please drop a comment and if the article helped you please thank me with a like or a share!

Post Views: 1,460

Marcus Miller

Marcus is our Digital Strategist. He’s been working in the industry for nearly 20 years and wears many (bowler) hats as a highly technical developer and SEO, and even has a fancy computer science degree to prove it. He encourages continuous, ceaseless improvement and his predictions of trends are always spookily correct.

All Posts »