Training spam and ham
I would like to improve Zimbra's ability to accurately distinguish spam from ham. My understanding is that there are two recommended ways to accomplish this:
1. Use the Zimbra webmail client to mark messages via the Junk and Not Junk buttons, respectively.
2. From any mail client, use the "Forward as Attachment" function to send single or multiple messages to the special spam/ham training user accounts.
The problem with option #1 is that this does not provide enough "ham" to the spam training accounts. Why? Because (thankfully) there are not enough false positives to mark as Not Junk. But the ability of the spam detection heuristics to determine spam from ham is dependent on analyzing roughly as many ham messages as spam messages. As far as I can tell, the Zimbra webmail client does not allow the user to click on a legitimate message in the inbox and click on a Not Junk button to send it along to the ham training account -- only the Junk button is available.
Once we figured out that messages had to be forwarded as attachments, option #2 above appears to be a viable method.
It occurred to me that there might be a third option, which would be to use a desktop IMAP client to copy messages from a real mail account into the mounted IMAP inbox of the respective spam/ham training user accounts. But I'm not sure how the mounting would be accomplished; while I seem to recall earlier Zimbra installs asked for passwords for these special accounts, I don't seem to remember setting passwords for these two special accounts in the most recent go-round (version 4.0.2 of open source edition, Mac OS X 10.3.8). Is this potential third option even possible?
We have been using both of the above two recommended techniques for several weeks, but unfortunately the accuracy level does not appear to be improving. The most significant reason for this, I believe, is that DSpam does not appear to be learning. In fact, DSpam's penalty is being added to every single message. I've been examining the X-Spam-Status header of all incoming messages, and they all have DSPAM_SPAM=2.5 in the "tests" array. Shortly after installing Zimbra, I modified the salocal.cf.in file in order to increase DSpam's weighting thusly:
header DSPAM_SPAM X-DSPAM-Result =~ /^Spam$/
describe DSPAM_SPAM DSPAM claims it is spam
score DSPAM_SPAM 2.5
header DSPAM_HAM X-DSPAM-Result =~ /^Innocent$/
describe DSPAM_HAM DSPAM claims it is ham
score DSPAM_HAM -2.0
So, from the above information, it would appear that "X-DSPAM-Result: Spam" must be appearing in the headers of every single message. Which is almost true, but not quite: that line appears in every message that DSpam has processed (i.e., DSpam thinks everything is spam -- not good), but there are some incoming messages in which no DSpam headers appear whatsoever. Why some messages would be coming through without any DSpam headers is a worrisome conundrum in and of itself, but even more perplexing is that even then, the "DSPAM_SPAM=2.5" score is being added in the "tests" array -- in the complete absence of any DSpam headers whatsoever.
The spamtrain.log is unfortunately of little assistance in diagnosing the above problems. While it appears that some messages are being used for learning...
command: '/opt/zimbra/dspam-3.6.2/bin/dspam' --class=innocent --source=corpus --user 'zimbra' --mode=teft --feature=chained,noise
/opt/zimbra/dspam/bin/dspam_corpus: 1 messages, 00:00:01 elapsed, 1.00 msgs./sec.
...the absence of a datestamp on each line (or at least at the beginning of the daily batch output) means that it's very difficult to grok when the logged output occurred. Also, there are other errors in the log, and it's not clear how serious these are and how they should be fixed:
config: could not find site rules directory
bayes: cannot open bayes databases /opt/zimbra/amavisd/.spamassassin/bayes_* R/O: tie failed: Inappropriate file type or format
bayes: cannot open bayes databases /opt/zimbra/amavisd/.spamassassin/bayes_* R/W: tie failed: Inappropriate file type or format
ERROR: the Bayes learn function returned an error, please re-run with -D for more information
My experience setting up DSpam manually in previous versions of Zimbra showed that DSpam is extraordinarily accurate when it is being fed a large enough corpus of spam and ham. Now that we've replaced that previous set-up with the integrated DSpam configuration included in Zimbra 4.0.x, it's not entirely clear why DSpam isn't marking any messages (as in zero) as Innocent.
Thank you for taking the time to read this very long-winded message. Any and all suggestions -- and especially point-by-point responses -- would be most sincerely appreciated!
have you checked this thread (broken training) ?
Yes, checked that thread
Hi Klug. Thanks for responding. I looked at that thread, and there was some good information in it. The bug that was filed as a result of that thread appears to be resolved -- or at least that's the status of the bug in Zimbra's bug tracker.
I thought perhaps updating to 4.0.3 would fix the problem, since the bug was filed against 4.0.2, but no such luck. The problem persists after upgrading to 4.0.3, so I assume the bug was fixed in a yet-to-be-released version.
So, following are how I fixed these errors, based on the information in the aforementioned thread. To fix the "cannot open bayes databases" errors, the following command did the trick:
$ sudo rm /opt/zimbra/amavisd/.spamassassin/bayes_*
To fix the "config: could not find the site rules directory" problem:
$ sudo mkdir /etc/mail
$ sudo ln -s /opt/zimbra/conf/spamassassin /etc/mail/spamassassin
And lo and behold, these two changes appear to have addressed all of the problems I reported in my original message, at least via manual zmtrainsa commands. Both SpamAssassin and DSpam appear to be -- for the first time -- actually learning from the messages in the spam/ham training accounts.
Tip for initial ham training:
As I mentioned in my original message, I don't recall setting passwords for the training accounts. From the Zimbra Admin interface, you can however click on each of them and then click the "Change Password" button to give them passwords. This allowed me to mount the ham training account via IMAP, drag over a ton of messages into its inbox, and run the zmtrainsa command:
zmtrainsa mail.mydomain.com email@example.com myacctpw ham
Question #1: Will my "re-setting" (assuming they were ever set in the first place) of the training account passwords cause a problem for the nightly training cron jobs?
Question #2: Some messages arrive with no DSpam headers whatsoever. What circumstances would cause this?
Question #3: Could we possibly get a timestamp at the beginning of a given night's batch in spamtrain.log? :)
Looking forward to your thoughts. Thanks for listening!