Using Spider Traps to Discourage Site Scraping

Computers & Technology → Search Engine Optimization

Author Rob Sullivan Sullivan
Published November 29, 2005
Word count 841

Sometimes your competitors will do almost anything to compete with you including stealing your content.

To do this they sometimes employ automated software much like a search engine crawler to make the process quicker and easier than manually copying your site. This can cause many problems for you.

In this article we look at ways to stop this from happening.

Stealing on the web is rampant. I don’t mean stealing people’s user id’s and passwords. I mean the stealing that goes on to a website.

Webmasters and designers steal images they like, or find a cool JavaScript they like so they steal that as well.

But what really causes problems is when your competitors steal your content.

As we all know, content is king on the web. Whoever has the most content wins. So if a competitor of yours needs to grow quickly, one of the easiest ways to do it is through the use of a website harvester.

A website harvester is no different than any other search engine crawler. It goes and requests all the URLs it can find and then proceeds to download all the content associated with those URLs.

So how does one protect themselves from malicious scrapers?

Simple really. You build a spider trap.

As the name implies, you create a section of your site devoted to luring in the spiders that are not friendly, and then you proceed to either trap them or ban them from accessing your site.

What’s involved in making a spider trap?

Usually a bit of PHP code combined with a database and a URL rewriter.

The first thing you need to do is create the space on the site dedicated to capturing those bad bots. You then use robots.txt to exclude that section from crawling.

You do this because you want to ensure Googlebot, Yahoo! Slurp, MSNbot and the others don’t also get trapped. Since most good spiders will follow the robots.txt exclusion protocol you are going to politely deny them access to this location.

From here there are various options. One of my favorite involves logging to a database or text file and then dynamically denying access to the bad bot.

How does it work?

Let me give you a practical example.

I once had a client that was getting harvested many times per day by many different bad spiders. It was so bad at one point that the bad bots were doubling his bandwidth usage.

So we devised a plan whereby we’d create this trap as mentioned above and when we captured the user agent and IP info we immediately banned them from the site.

This is how it worked:

The bad bot would come to the site and find a link on an image. The link would point to the trap directory.

Normally, a regular spider would first check the robots.txt file to ensure they could in fact index the content in that directory. Since the file excluded this directory, the “good” spiders wouldn’t go in.

However the bad spiders ignored robots.txt and went into the directory.

From here there was a PHP script which would run and capture the user agent and IP address.

Another script would take that information and rewrite the .htaccess file with the bad spider information as soon as it was received.

Then the .htaccess file was reloaded by the server and that spider was then not allowed to visit the site anymore.

In another incarnation this system would right to a database or text file that is then referenced by the site pages through a small php script that would allow or deny access based on that list.

Keep in mind that this is very advanced stuff. You don’t want to take it on lightly. Doing so can (and likely will) get your site removed from any search indexes.

Not because you are doing something you shouldn’t but because there’s always the chance that somehow a good spider like Googlebot would end up on your blacklist.

Therefore, before you get into such advanced things, I’d make sure you are intimately familiar with what they are and how they work.

A good place to start is to read this page. I wouldn’t advise using this code just yet, but take a look at it to see what it can do.

Also, do some searches on the engines for “bot trap” and “spider trap” to see what other options there are out there. Then, pick the one that works best for you.

In the end, the best bot trap is the one that does what it is supposed to – block harvesters from scraping your site while allowing legitimate search engines to effectively and efficiently index your site.

And if you are at all concerned with this tactic, don’t use it. It’s better to use the manual approach – scour your server logs looking for high activity from unknown user agents then manually ban them using .htaccess.

Rob Sullivan - SEO Specialist and Internet Marketing Consultant. Reproduction of this article is allowed with an html link pointing to http://www.textlinkbrokers.com

Using Spider Traps to Discourage Site Scraping

Rate article

Article comments

Related articles

Related articles

Google SEO Updates 2024 Jackyan

Why is SEO for businesses on the Wirral important

Search Engine Marketing: Unleashing Its Power for Your Business

How SEO Can Improve Your Business?

Mastering the TikTok Algorithm: 15 Strategic Tips to Amplify Your Video Views

SEO Myths - Separating Fact from Fiction and Avoiding Common Pitfalls

How To index Your Backlinks Using Twitter

The Advantages of Outsourcing Your Payroll to ADP Payroll Processing Services

Find Your Perfect Job or Top Talent with Premier Staffing Agency

Connect Printer to WiFi: Step-by-Step Guide for Seamless Printing.

Artificial intelligence and its role in SEO positioning

Navigating the SEO Landscape in Dubai: Key Considerations When Hiring an SEO Expert

How AI Can Help Optimize Content for Voice and Visual Search

How Can AI Help Automate and Improve SEO with Link Building?

7 Common SEO Mistakes That Companies Should Avoid at All Costs

What is Free Online Reverse Image Search and Its Benefits - Full Guideline

The Crucial Role of SEO in Content Discovery

How do I apply SEO to my Google site?

Top UI/UX Myths Debunked in Detail

High-Quality White Label SEO Reseller Services

Exploring Digital Marketing Trends in 2023

What is web accessibility and how to make your small business website accessible

Shield Tunneling: Air-Cushioned Slurry Balanced Shield Machine

Reno SEO Uncovered: The 10 Essential Factors for Online Success

What Are The Steps To Optimize A Dental Website To Increase Its Traffic Rate?

Why UI UX Design Matters for Your Website's Success

Get More Business as an Electrician with Vancouver SEO Agency's Certified SEO Experts