towards a better comments spam filter

July 15, 2005

Lately I've been cleaning up in preperation for transferring to a new webhost. (Hi FoSO!) It's a generally pretty satisfying goal has kind of been simplifying what's in the "root" directory of my site.

But there was one other task that I brought on my self...clean up the tens of thousands of spam links that were clogging up the comments for much of 2003 and 2004. The dirty rotten spammers like to have lots of links to their sites so they gain prestige with naive search engines that follow a bastardized version of the Google-esque ideal that incoming links = good sites, and they seem to prefer filling up old comments pages that search engines see just fine but some site owners won't notice for a while. They'll take a day, then, and have their scripts fill it with like 35-30 entries of 20-30 links each. I found out the scale of the problem when I made that view delayed comments feature a few months ago, and spent like 5 or 6 hours last night cleaning...and using a lot of custom scripts and macros to help that.

For now I'm using a simple filter...if you don't have a link in your comment, no filtering. If you DO have "http://" in your post, then the following is a list of forbidden words:
high low
blue deer
purple bike
slot machine
black jack
health insurance
Pretty brutal, but, maybe a temporary message (more on that below.) A message will come up repeating this list if you trigger the block. Cleaning out the old comments, there were only like 4 or 5 "false positives" of non-spam comments that would have triggered a block.

I have a (possibly) better idea for a block that, after a few days of seeing how this method works, I might install. It's tough thinking of good blocking methods that won't block people as well. One assumption I was making before is that the spammer downloaded my comments form once, and then just had their slave botscripts run the "submit comments" feature. If that were true, then I could use a simple token method. Loading the comments form generates a special one time key, the key is stored on the server, and then passed back to the server on the submit. Two downsides: one is having a lot of disused keys on my server, for people who just view comments and/or the form without entering a comment. But there's a chance that the assumption that the dirty rotten comments spammer is use a "stale" version of my form might be wrong...maybe they're downloading a fresh form each time, so they'd get a fresh token for each of their nasty, stinking, grasping writhing cesspool chunks of spamlinks.

Other people try to use javascript cleverness, but there's nothing stopping a spammer from always running your filter through a javascript engine.

So one idea is this; there's a good chance the spammers look for a field labeled "comments" or "name", or just look for a big textarea followed by a smaller one line text box, or something like that. But what if each time you generated a random name for your textarea, and then have some other hidden variable tell the script which name to find the comment in? It seems like a spammer might not bother to follow that kind of indirection, and it can be made a little stronger by increasing the levels of indirection (you always have one variable "foo" that tells it the first variable to look in, which carries the name of the second variable (randomly generated), which carries the name of the the third, which then points to the actual name generated for the textarea. This method won't stand up if they're just looking for "grab all variables and fill the textarea variable", though I'm not sure if they're that clever...maybe I could include 3 textareas, and hide the other 2 with CSS. That too could be scripted which case I'm back to simple content filtering.

Ugh, what a mess! Comment spammers are an incredibly low lifeform. I'm lucky, since I do all the coding myself, and am just one site, if I come up with a clever idea I'm a low-value target, relative to an attack that works on bajillions of blogs using the same scripts.

UPDATE: in September I started getting odd, no-link one liner comments. They might be test mesages or something. So as a patch, I've added some phrases that get rejected if they're the sole content of the message. See this day's entry for a bit more info. Admittedly this is very much a "one off" kind of response, but we'll see how it goes, since so far the href+buzzword filter has done well.

Link of the Moment
After all that crap, you deserve a nice link: la Pate a Son is a lovely applet for making musical Rube Goldberg-meets-that-old-"Pipe Dreams"-game setups. I think that it is tweaked to make it almost too hard to make ugly sounds, but it's still pretty cool. (Makes me think that I should try to do a Java or other technology port of that old SimTunes game.)

Oh, I blogged this a long time ago, check out their main site, especially experimental zoo, especially the giraffes. (A bit like that woman falling through bubbles, come to think of it, but years earlier.)