Results 1 to 7 of 7

Thread: spam filtering/training methodology

  1. #1
    Join Date
    Jun 2008
    Location
    Berkeley, CA
    Posts
    1,474
    Rep Power
    9

    Default spam filtering/training methodology

    This arises out of some bugzilla comments; instead of cluttering up bugzilla, I thought it'd be better to turn it into a forum thread for further discussion, if any.

    Basically in Bug 9532 - IMAP/Outlook move to junk doesn't train anti-spam and Bug 37164 - mail filed into Junk by Filters is not used to train anti-spam I raised the question of whether Zimbra ought to train SpamAssassin on messages which have been auto-filed into Junk by a user's Filter.

    Currently Zimbra doesn't do this directly, but the capability exists in a sort of roundabout way. E.g., if you have Outlook or an IMAP mail client, you can have it move messages to Junk based on keywords, or on the client's own rules/heuristics, and Zimbra will end up training on that basis. On the face of it this seems to be a valid approach.

    But consider this case: a user tells their client to filter based on X-Spam-Level. Dan Martin commented,
    it seems to me if a user is trying to do filter-based training on X-Spam scores specifically, something is funky with the overall setup. X-Spam is, after all, supposed to be the automated scoring system. If a user is getting a lot of false negatives (essentially what a "too low" X-Spam score is), then it seems to me you should analyze the overall scoring of those suspect emails and identify, and if necessary refine, the filters accordingly.
    There are a bunch of ways to go about this, sure. You could re-weight the scores. You could lower the threshold from the default 6.6 to 5.1 or whatever. You could add more detection methods such as DCC, Pyzor, Razor, or image spam detection.

    But there may still be some information left in the unique combination of recipient + spam score. I mean simply, for certain users, if email is over a fairly low score, then it's guaranteed to be spam, and it may be valid to feed that information back into the system. In theory this resembles some spam training systems that self-train not only on user-sorted false positives/negatives (the way that Zimbra does) but also on their existing corpus. "The spam gets spammier and the ham gets hammier," as it were. If you look at this apache.org page on SA, this is basically item 2 in the section on Effective Training. (ASSP also feeds automatically-identified spam back into its training system, see here.)

    The danger here is there could be a chance of "drift" due to unsupervised feedback. Effectively, certain "tokens of legitimacy" could be "poisoned" by being statistically associated with spam, until they become spurious primary indicators of spam. I'm not aware of any real-world exploitations of this concept, but I did find an article discussing it: Does Bayesian Poisoning Exist? (PDF);

  2. #2
    Join Date
    Jul 2007
    Location
    San Jose, CA
    Posts
    1,027
    Rep Power
    10

    Default

    Hey Elliot,

    Berkeley, eh? I'm in San Jose. . .methinks this ought to be discussed over a beer. . .

    Anyhow, it's at least in part the Bayesian poisoning that worries me, so you're definitely onto me. But the bigger issue I was digging at is that, at least according to what I have observed in production, an awful lot of people have their point threshold set too high. Therefore, too much gets into the non-junk classification (by my standards, anyhow), and therefore requires Bayesian filtering to correct it.

    I have addressed this two ways. One, I have lowered the overall threshold (in my system a positive score of only 3 gets you in the Junk folder), and two, I have given very high scores (+- 5) for the extreme high (BAYES_99) and extreme low (BAYES_0) ends of the Bayes score. The combination of the two has allowed the user preference to have a stronger weight, but casts the burden of newly-discovered mail on the side of not looking like spam. . .or perhaps I should say "if it looks kinda like junk, my presumption is guilty."

    This has not resulted in many false positives--extremely few in fact--and those that do happen are easily remedied with the Bayes filter "not junk" in most cases. There is one source I had to manually whitelist because of my hostility to the commercial whitelists, but that's a complication of my own making. . .other than that it's really quite smooth for us at least.
    Cheers,

    Dan

  3. #3
    Join Date
    Jun 2008
    Location
    Berkeley, CA
    Posts
    1,474
    Rep Power
    9

    Default

    Sure, look me up when you're going to be in my neck of the woods, Dan.

    When/if we go to production with Zimbra, I definitely plan on tweaking the scores. Threshold, not so much, but I hope to add some more score inputs such as those I mentioned in my first post. (E.g. uceprotect level 2 generates some false positives when I use it to block at the MTA level, but I can use it to score.)

    One thing I wonder...according to HowScoresAreAssigned - Spamassassin Wiki, it seems that SA is supposed to adjust the scores for Bayes values on its own, but it looks to me like they're fixed (e.g. BAYES_99 is set at 3.5) and will have to be manually adjusted as you mention.

  4. #4
    Join Date
    Jul 2007
    Location
    San Jose, CA
    Posts
    1,027
    Rep Power
    10

    Default

    I don't claim to be an authority on SpamAssassin--particularly in its native form which I have not used--but I can confirm without a doubt that the scores remain fixed in the Zimbra implementation. I agree that it appears dynamic in the wiki to which you linked. I don't know if that's a version difference or a question of implementation, however.
    Cheers,

    Dan

  5. #5
    Join Date
    Nov 2006
    Location
    UK
    Posts
    8,017
    Rep Power
    25

    Default

    Have a look at
    Code:
    /opt/zimbra/conf/spamassassin/50_scores.cf

  6. #6
    Join Date
    Jun 2008
    Location
    Berkeley, CA
    Posts
    1,474
    Rep Power
    9

    Default

    What I see starting at line 841 is
    Code:
    # make the Bayes scores unmutable (as discussed in bug 4505)
    ifplugin Mail::SpamAssassin::Plugin::Bayes
    score BAYES_00 0 0 -2.312 -2.599
    score BAYES_05 0 0 -1.110 -1.110
    score BAYES_20 0 0 -0.740 -0.740
    score BAYES_40 0 0 -0.185 -0.185
    score BAYES_50 0 0 0.001 0.001
    score BAYES_60 0 0 1.0 1.0
    score BAYES_80 0 0 2.0 2.0
    score BAYES_95 0 0 3.0 3.0
    score BAYES_99 0 0 3.5 3.5
    endif
    So, that refers to https://issues.apache.org/SpamAssass...ug.cgi?id=4505. (Discussion starts about comment #34.) I guess they need to update their wiki. Thanks!

  7. #7
    Join Date
    Jul 2007
    Location
    San Jose, CA
    Posts
    1,027
    Rep Power
    10

    Default

    And beside their grammatical error (the word is "immutable," not "unmutable"), it is these scores that I have chosen to override with my own immutable scoring, in local.cf.
    Cheers,

    Dan

Similar Threads

  1. Weird behaviors and LOTS of spam.
    By zwvpadmin in forum Administrators
    Replies: 7
    Last Post: 01-02-2009, 10:26 AM
  2. spam - ham training
    By Viking0 in forum Administrators
    Replies: 6
    Last Post: 12-02-2008, 01:07 PM
  3. Major SPAM to one account
    By CarputerTech in forum Administrators
    Replies: 4
    Last Post: 09-04-2008, 11:54 PM
  4. Trying to understand Zimbra's anti-spam system
    By TaskMaster in forum Users
    Replies: 11
    Last Post: 01-25-2008, 09:59 AM
  5. Training spam and ham
    By Justin in forum Developers
    Replies: 2
    Last Post: 10-31-2006, 03:39 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •