Deflecting Comment Spam
08/04/2006
Permlink
It's hard to believe that anyone would bother putting comment spam on this website. It's not like a get a million hits a day. Hell, it's not like I get a hundred. Still, the boneheads do it, so I had to find a way to put a stop to it. The problem is that I don't really know anything about their operation. What follows is speculation, but it is bolstered by the fact that the method I came up with does seem to work.
My assumption is that there is precious little money in the comment spam business. I can't imagine that anyone pays more than a fraction of a cent per comment actually placed. So, these guys have to place a hell of a lot of comments and they can't spend any time or effort on any one comment or site, else it wouldn't pay. I've also noticed that, when they get hold of a spot that works, they seem to leave their spam in the same thread over and over. This is pretty stupid, but can be explained one simple way: they just capture a POST to a given URL that seems to work and then use it over and over again. It makes sense that they just have a database that describes the parameters to pass in each HTTP POST and they run through it, sending the same comment to thousands of sites at once. Given these thoughts, how might they be stopped?
Captcha
One favorite is the captcha, that little decipher-the-graphic puzzle we often see in comment forms. It's interesting to note that captchas where initially developed for a much more serious purpose: to prevent software bots from signing up for Hotmail and GMail accounts. These are much more serious applications than comments. Those accounts have value and the people designing those bots could spend a lot more time making them smart than the people spamming comments where the value is very low. While there's not much rocket surgery to a captcha, it's still code to write, graphics to develop and users to annoy and I decided not to do it unless I had to.
There's a second form of captcha, where you give the user a simple little puzzle to solve that is easy for a human, but very hard for software. You might say, "enter the answer to this question: 4 + 2 = ?". As long as the question is different each time, the bots have to actually work to solve it and that's more than anyone is going to bother with. Still, it takes time to code and annoys the user.
Session Tracking
Another way to stop the spamers would be to track the user's session (usually with a cookie) and when you send down the comment form, send down an authorization code that is tied to that session on the server. When the comment is submitted by the user, it must include the right authorization code or it won't be accepted. This isn't too hard to code, but it does require that the user have cookies enabled, which is a slight limitation. It also means saving authorization codes and bookkeeping them through the process, which is the kind of code that leads to bugs. Trust me, I've been doing this sort of thing for a long, long time. Any time you're saving things over multiple requests there is always a chance you'll lose track of them and thus have a memory leak that means your server will crash mysteriously every few days. Been there, done that. Those sorts of things have to be coded carefully and tested well.
My Cheesy, Easy Solution
When you get down to the bottom of it, we're just trying to ensure that the URL and parameters that spamers want to POST their comments to isn't always the same. If today's POST is different from yesterday's then the spamers would have to do work to figure out the scheme and my assumption is that they can't make any money at it if they're always having to tinker with the system.
So, what's the difference between today and yesterday? It isn't a trick question: the date, of course. So, I now send down the current date, expressed the way Java likes to do it on the server: as the number of milliseconds since midnight Jan 1, 1970. I divide the number by 19, just to obscure what I'm doing a little, but that little touch probably isn't needed. I stick this value in a hidden field of the comments form (you can see it for your self if you look at the HTML source of one of my comments pages - it's right by the Submit button in the HTML).
On the server side, I read that hidden field in, multiply it by 19 and compare it to the current time. If it's within four hours, I take the comment. If it's not, I don't save the comment. I don't, however, return an error. I don't want the spamming bastards to know that their spam isn't sticking, let them keep burning up their bandwidth sending me the spam.
I've had this in place for two days now and it has deflected eight spam comments, so I suspect my assumptions were valid. Once more, being too lazy to bother with doing it "right" has led to a solution that is, for now at least, much simpler and completely invisible to my users. Note: something this simple might not work for long if you used it on a big, important site that was worth the spamer's time to figure out.
This reinforces a lesson I learned long ago: always hire lazy programmers. They'll try to think their way around problems rather than coding around them and that's nearly always a better way to go.
Update: This method, did not work! However, a little tweak to this finally did.
Update: Actually, I found an even simpler and cleaner way to do this, here.
1 comment:
This is a test comment, to make sure I haven't screwed things up.