How To Use Multiple CSS Backgrounds, a Tutorial
Packaging Disaster with the Creative I/O USB Dongle
Let's Battle Comment Spam with a PHP/MySQL DNSBL and no CAPTCHA
Why Writing a Blogging Engine is not an Absolute Waste of Time
Another One Bites the Blog-o-Sphere — Let's Do it with Style
Fix Apache's httpd.pid Conflict with Skype
Branding Presidential Candidates — the McCain and Obama Campaigns
Going Public with a New Layout
Battle of the Bits 3.0 in the works thanks to FECES
Putting Hyperlinks in a PDF Document with Adobe inDesign
Tutorial Run : Outer Space Text Effect
Tutorial Run : 3D Glass (ice) Text Effect
Made a List, Checked it Twice ;D/
Can we say goodbye to Internet Explorer 6 yet?
Shopify blog to Feedburner to Yahoo Pipes
Too Much White Background to Handle o___@
Making BotB's entry Table Better for Sorting
The Spam Battle Continues...
Tuesday, September 8th 2009 2:52pm
I took my first whack at spam. Blocking IPs was a good start but I'm still spending too much time deleting these comments. Spam seems to have no shortness on available IP addresses nor do they run out of cute things to write —
"We are Dyslexia of Borg. Fusistance is retile. Your ass will be laminated."
Now I'm going to target the URLs
The whole point of comment spam is to get hotlinks all over the web. It's a vain attempt to increase the search rank of annoying and/or malicious websites. So let's kick'em in the family jewels.
First, I renamed my `ip_blacklist` MySQL table to `blacklist_ip` and created a similar `blacklist_url` table.
CREATE TABLE `blacklist_url` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT ,
`url` VARCHAR( 255 ) NOT NULL ,
`threshold` TINYINT NOT NULL DEFAULT '1',
PRIMARY KEY ( `id` )
) ENGINE = MYISAM
I'm still giving the URLs a chance with the `threshold` variable.
I put together the following function to create an array of all URLs found in a block of text —
function ExtractURLarray($text) {
$a = array();
preg_match_all('/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i',$text,$a);
$a = array_unique($a[0]);
return $a;
}
The comment's attached web address is an optional form field; not necessarily in the comment's post text —
$urlList = ExtractURLarray($comment->text);
if ($comment->website!='')
$urlList[] = $comment->website;
Get Them at the Domain Level
Absolute URLs from spam typically point to a forum post or user account on a victimized website, or a single page on a malicious one. Some spam posts have many URLs, often pointing to multiple pages on the same domain. If the domain appears malicious then it's better to block it entirely rather than letting the threshold buildup to 3 on an absolute URL.
The following code works best with the protocol prefix intact (http/ftp) —
function RipDomain($url) {
if(strpos($url, '/', 8))
return substr($url, 0, strpos($url, '/', 8));
else
return $url;
}
Make Yourself Some Options
In the previous spam battle post I added a SPAM button at the bottom of the comments. What I've done is add an extra step before the IP is added to the block list. Now I can decide how to handle each URL in the offending comment.

This new level of spam triangulation should continue to turn the tide.
posted by Langel
Leave a Comment

