How to use .htaccess to prevent duplicate content - Heart Internet Blog - Focusing on all aspects of the web

All site owners want their audience to find them easily online and ranking highly on Google is definitely a good way to do it. While there are many factors that search engines take into account when ranking a site, having unique, relevant, and useful content is right at the top.

But there is a common problem that site owners need to deal with, especially when their sites have hundreds of pages, and that’s duplicate content. Ignore it, and you risk being penalised by Google, making you almost invisible to potential customers who are searching for your products or services online.

Find out what qualifies as duplicate content, why it can lead to problems, and what are some smart ways to avoid it.

What is duplicate content and how does it affect your website?

Icons of a laptop, an mouse arrow on a screen, and a stack of paper

Duplicate content refers to two or more pages on a website that have identical content, with the only difference being the URL, even if the URL differs by only one character.

Google indexes information about each site and each page, and it sees every URL as a separate page. For example, take these two URLs:

  • genericscreenshotdomain.com/generic.html
  • www.genericscreenshotdomain.com/generic.html

So what is the problem? While this is just two different ways to access the same page, Google sees these as two pages – different URLs, same content – and marks it as duplicate content.

With duplicate content, search engines have a difficult time deciding which page is the most relevant for a search query. And if there are enough of these duplicate pages, they can penalise the site for duplicate content issues, and that decreases the search engine ranking for that site.

How to use .htaccess to solve this problem

There is an easy fix, however. If your site is running on an Apache server (and most Linux packages will be), you can create an .htaccess file to avoid duplicate content issues. This file informs the server how to serve files to the browser. With your own .htaccess file, you can redirect users and search engines to specific URLs.

Why redirect them? Since the search engine is seeing your content show up on what it thinks are two different pages, you need to point them to the right location. You need all the variations possible on that URL to redirect to your preferred URL.

For example, if you prefer having people go to www.genericscreenshotdomain.com/generic.html you can redirect genericscreenshotdomain.com/generic.html to go to the www version. Google and other search engines will only index your preferred version, and that will be the only URL for that page.

This is a 301 redirect, and it’ll help you immensely with your content.

How to fix duplicate content issues

Icons of an upload, a monitor, and a download

There are some standard problems that site owners face with duplicate content, and, luckily, these are all easy to fix with an .htaccess file. Open up your favourite text editor (Notepad++, TextEdit, Leafpad, or Metapad, to name a few) and save a file as .htaccess.

Here are just some of the problems:

www.genericscreenshotdomain.com/generic.html is also accessible under genericscreenshotdomain.com/generic.html

Both the www and non-www URLs point at the same page, but since the URLs look different, Google sees them as different pages. To fix this, add this to your .htaccess file:

# Force www
RewriteEngine On
RewriteCond %{HTTP_HOST} ^example\.com [NC]
RewriteRule ^(.*)$ https://www.example.com/$1 [L,R=301,NC]

This is a permanent 301 redirect, which means that any non-www link, whether genericscreenshotdomain.com/generic.html or genericscreenshotdomain.com/blue.html or just genericscreenshotdomain.com will point to the www version of that URL.

https://www.genericscreenshotdomain.com/generic.html is also accessible under https://www.genericscreenshotdomain.com/generic.html

The second URL uses SSL encryption, which is the preferable URL. To make certain that all your pages are accessed only through the SSL certificate, add this to your .htaccess file:

# All calls go to SSL
RewriteEngine on
RewriteCond %{HTTPS} !on
RewriteRule (.*) https://%{HTTP_HOST}%{REQUEST_URI}

This way, even if someone types in https://www.genericscreenshotdomain.com/generic.html, they’ll be automatically redirected to the secure version of your site.

EDIT: Please note – if you’re on a Heart Internet server, you will need to use:

# All calls go to SSL
RewriteEngine On
RewriteCond %{ENV:HTTPS} !=on
RewriteRule (.*) https://%{HTTP_HOST}%{REQUEST_URI}

www.genericscreenshotdomain.com/generic is also accessible under www.genericscreenshotdomain.com/generic/

Both point to the same page, but using a trailing slash in the URL can cause duplicate content problems. Add this to your .htaccess file to take care of the problem:

# All URLs have trailing slashes
RewriteEngine on
RewriteCond %{REQUEST_URI} /+[^\.]+$
RewriteRule ^(.+[^/])$ %{REQUEST_URI}/ [R=301,L]

If you want to stop all three of these problems from happening, you can use this code:

# Start redirects
RewriteEngine On
# Force www
RewriteCond %{HTTP_HOST} ^example\.com [NC]
RewriteRule ^(.*)$ https://www.example.com/$1 [L,R=301,NC]
# Force SSL
RewriteCond %{HTTPS} !on
RewriteRule (.*) https://%{HTTP_HOST}%{REQUEST_URI}
# Force trailing slashes
RewriteCond %{REQUEST_URI} /+[^\.]+$
RewriteRule ^(.+[^/])$ %{REQUEST_URI}/ [R=301,L]

Once you’ve added them to your file, save and upload to your website’s main folder via FTP. You may need to remove the “.txt” at the end, but once you do that, it’ll start working.

Further advice on duplicate content

Icons of a word bubble, a book, and a hand giving a thumbs up

It can be daunting to get deep into your .htaccess file. But here are some tips to help you:

Test all variations

Check all the possible variations of your website’s URLs to ensure that your redirects display the correct version. We’ve given you three possible examples, but there could be more.

Test your redirects

There are browser plug-ins that can test the status of the redirect. A good one for Chrome is Redirect Path.

Check how Google indexes your site

You can use Google Webmaster Tools to see how Google indexes your site and whether there are any issues you need to fix.

Clear your browser cache

Make sure to always clear your browser cache after you’ve made changes to your site. This way, you always see the latest version.

Limit the number of consecutive redirects

If you’ve added plenty of redirects, you can have problems when search engine bots look through your site. In this YouTube video, Google’s Matt Cutts recommends not doing more than three consecutive redirects.

Update your internal links

Once you’ve decided which version of your domain you’re going to use, check all your internal links to make certain they work properly.

Set your redirect strategy right from the start

We’d recommend setting these redirects straight from the start when you’re creating your website. There’s less risk of breaking anything, and it means you’re avoiding confusing search engines.

Get more out of your .htaccess

If you want to do more with your .htaccess file, you can find more snippets and redirect codes in this great repository on GitHub.

Icons of a large number of documents, an arrow pointing right, and a single document

With duplicate content, prevention is better than a cure. Make certain search engines know what they’re pointing to with your .htaccess file.

What other ways would you use to avoid duplicate content issues?

Icons from Linearicons by Perxis.

Subscribe to our monthly Heart Internet newsletter, filled with the latest articles about web design, development, building your business, and exclusive offers.

Subscribe now!

Comments

Please remember that all comments are moderated and any links you paste in your comment will remain as plain text. If your comment looks like spam it will be deleted. We're looking forward to answering your questions and hearing your comments and opinions!

Leave a reply

  • oli

    15/01/2016

    You are wrong, you need to read your own knowledge base.

    RewriteCond %{ENV:HTTPS} !=on

    is the correct evaluation due to your different server setup.

     
    • Kate Bolin

      15/01/2016

      Hi Oli!

      You’re right, and we’re sorry about that. I’ve changed the blog post to include our server information.

      Thank you!

       

Comments are closed.

Drop us a line 0330 660 0255 or email sales@heartinternet.uk