Published in Brighton, UK

Clagnut

I blame Tonga

You may have noticed that Clagnut was down for most of last week. The culprits were referrer spam robots maxing out my database connections, resulting in my ISP putting my account on hold. Unfortunately my ISP also prevented email and FTP access which didn’t exactly help the situation.

The main culprits were pharmaceutical pedlars using sub-domains such as buy-fioricet.drop.to. Fortune City owns the drop.to domain (and others like it), and flogs the sub-domains to spammers. No doubt Fortune City would claim they don’t knowingly sell to spammers, but they are well aware these lowlifes buy their services as the offending web sites have been shut down. Unfortunately shutting the sites down doesn’t stop the robots effectively inflicting a denial of service attack on sites like mine.

Clagnut suffers from the robots because it has dynamic pages – each blog entry is pulled from a database when you view the page. The queries and database tables are well optimised so it’s not normally a problem, but if an army of robots turns up, as happened last week, then they use up all the available database connections – in my case maxing out at 200 simultaneous connections per second. The irony is that all referrals listed here carry rel='nofollow' attributes so they won’t even gain any benefit from being shown.

As of this weekend I’m successfully fending off referrer spam robots with a blacklist of referrers which is checked before a database connection is made. In addition, a 403 Forbidden is given to all user agents claiming to have come from a .to domain, by using an .htaccess rule like this:

RewriteCond %{HTTP_REFERER} .to/? [NC]
RewriteRule .* - [F]

Clearly the blacklist approach is not scalable, but what else can I do? The robots usually identify themselves as IE6 so I can’t filter that way, and I wouldn’t want to keep out legitimate robots such as search engines, so I’m not really sure what my next steps can be. Is there something I should be getting my ISP to do as well? Any help gratefully received…

17 October 2005

§ Clagnut news · PHP/MySQL

14 comments

Next

Previous

Related posts

Keywords

Machine tags

Comments

  1. 1

    Maybe have a word with Shaun Inman. For Mint, he was able to block out spam robots in the visits stats.

    Jon Hicks
    Jon Hicks’s Gravatar
    17 Oct 2005
    14:22 GMT
  2. 2

    Or we could hire some bloodhounds and hunt them down… ;)

    No seriously, there should be a blacklist that one could subscribe to to block half-wits like this.

    Mats Lindblad
    Mats Lindblad’s Gravatar
    17 Oct 2005
    14:44 GMT
  3. 3

    My .htaccess file contains a blacklist of about 150 referrers that are redirected back to wherever they came from. It works well and I haven’t had any performance problems. Yet ;-).

    Btw, my referrals aren’t even made public, so spamming me is completely pointless. But the robots don’t know that.

    Roger Johansson
    Roger Johansson’s Gravatar
    17 Oct 2005
    16:20 GMT
  4. 4

    http://www.modsecurity.org/

    Only a handful of hosts will have it installed but you could always suggest they give it a whirl.

    Chris Winfield
    Chris Winfield’s Gravatar
    17 Oct 2005
    16:48 GMT
  5. 5

    If I am not mistaken, Mint “blocks” referer spam by using JavaScript to track visitors. Spambots don’t support JavaScript, so they aren’t counted as visitors (neither are Lynx/w3m/etc users, people with JavaScript disabled, etc). This usefully blocks referer spam from logs (if almost all users have JavaScript enabled), but it doesn’t do anything to reduce server load from spambot traffic.

    If the bots are coming from a few IP addresses, you could implement surge protection.

    bpt
    bpt’s Gravatar
    17 Oct 2005
    20:25 GMT
  6. 6

    Just call the mob and have ‘em off’d.

    Dustin Diaz
    Dustin Diaz’s Gravatar
    17 Oct 2005
    23:33 GMT
  7. 7

    Huh, wth. And then a nice provider – I’d go bananas.

    Roger, I’m curious about your .htaccess ;)

    Jens Meiert
    Jens Meiert’s Gravatar
    18 Oct 2005
    11:56 GMT
  8. 8

    Roger, Im curious about your .htaccess ;)

    Me too Roger. How exactly does one redirect a spammer whence they came?

    Rich
    Rich’s Gravatar
    18 Oct 2005
    13:04 GMT
  9. 9

    Could you not create a funky cache, so the pages don’t have to be pulled from the database every time?

    Tom
    Tom’s Gravatar
    18 Oct 2005
    13:06 GMT
  10. 10

    The last few lines from my block of referrer spammers:

    RewriteCond %{HTTP_REFERER} zindagi [OR]
    RewriteCond %{HTTP_REFERER} zoker9 [OR]
    RewriteCond %{HTTP_REFERER} zone-b51
    RewriteRule ^.* h-t-t-p://%{REMOTE_ADDR}/ [L]

    (http replaced with h-t-t-p to avoid auto-linking in this comment)

    I think I found the technique at Caveat lector: http://cavlec.yarinareth.net/archives/2005/01/11/killing-referrer-spam/ .

    More links to ways of dealing with referrer spam:
    http://www.456bereastreet.com/movabletype/mt-search.cgi?IncludeBlogs=1&search=referrer+spam

    Roger Johansson
    Roger Johansson’s Gravatar
    18 Oct 2005
    21:22 GMT
  11. 11

    > The robots usually identify themselves as IE6 so I cant filter that way

    Are you sure about that? How many visitors do you get that use Internet Explorer?

    In any case, I concur with Tom; hitting the database each time when you don’t need to seems very wasteful.

    Jim
    19 Oct 2005
    14:19 GMT
  12. 12

    > The robots usually identify themselves as IE6 so I cant filter that way
    Are you sure about that?

    Yes, because I have been tracking the user agent strings.

    In any case, I concur with Tom; hitting the database each time when you dont need to seems very wasteful.

    That’s entirely my point. There’s nothing wrong with database-driven dynamic pages – plenty of sites work that way including hugely popular ones like Multimap and everything is hunky dory for normal amounts of traffic. The attack I was under from robots effectively increased my traffic by 1000 times!

    Even if I had static pages I would have shot through my bandwidth limit in about an hour, so that’s why I was asking for advice for keeping out the robots at a higher level.

    Rich
    Rich’s Gravatar
    19 Oct 2005
    16:36 GMT
  13. 13

    The real problem as someone said is that you hit the db each time. There is no such thing as ‘optimized’ code that relies on redundant db querying.

    Caching would be one solution, another would be to simply write your posts after X days so they become flat files. You could simply dump them out as XML so you still have the data seperate.

    You could even just dump them out as XML from the very start, when someone posts a comment you republish ‘1596_comments.xml’ from the data in the db. That would mean the majority of visitors won’t even touch your database.

    Ben
    25 Oct 2005
    23:15 GMT
  14. 14

    Ben – thanks for your input, but using flat files instead of dynamic files would just push the problem somewhere else. Sure it means the site would be less likely to go down due to database failure, but it would mean I get hammered by bandwidth fees. This robot attack meant that I my normal monthly bandwidth usage would be consumed in an hour – that’s nearly 1000 times the expected amount of traffic.

    And like I said huge sites like Multimap query the database for each of their 8 million daily page views (although admittedly there won’t be a new db connection opened for each of those).

    The root cause is therefore not that I hit the db each time, but the robots. Keep the robots out and the site behaves. Don’t get me wrong – I do understand that flat files are more server-efficient – that’s why my RSS files are static pages – but changing that won’t solve the problem.

    Rich
    Rich’s Gravatar
    26 Oct 2005
    08:04 GMT

Add your comment

Comments are now closed on this post. If you have more to say please contact me directly.

Outside interest

Top Referrers