Creating a spam-stat dictionary (Gnus Manual)

Next: Splitting mail using spam-stat, Up: Spam Statistics Package [Contents][Index]

10.18.8.1 Creating a spam-stat dictionary

Before you can begin to filter spam based on statistics, you must create these statistics based on two mail collections, one with spam, one with non-spam. These statistics are then stored in a dictionary for later use. In order for these statistics to be meaningful, you need several hundred emails in both collections.

Gnus currently supports only the nnml back end for automated dictionary creation. The nnml back end stores all mails in a directory, one file per mail. Use the following:

Function: spam-stat-process-spam-directory ¶: Create spam statistics for every file in this directory. Every file is treated as one spam mail.

Function: spam-stat-process-non-spam-directory ¶: Create non-spam statistics for every file in this directory. Every file is treated as one non-spam mail.

Variable: spam-stat-process-directory-age ¶: Maximum age of files to be processed, in days. Without this filter, re-training spam-stat with several thousand messages could take a long time. The default is 90, but you might want to set this to a bigger value during the initial training.

Usually you would call spam-stat-process-spam-directory on a directory such as ~/Mail/mail/spam (this usually corresponds to the group ‘nnml:mail.spam’), and you would call spam-stat-process-non-spam-directory on a directory such as ~/Mail/mail/misc (this usually corresponds to the group ‘nnml:mail.misc’).

When you are using IMAP, you won’t have the mails available locally, so that will not work. One solution is to use the Gnus Agent to cache the articles. Then you can use directories such as "~/News/agent/nnimap/mail.yourisp.com/personal_spam" for spam-stat-process-spam-directory. See Agent as Cache.

Variable: spam-stat ¶: This variable holds the hash-table with all the statistics—the dictionary we have been talking about. For every word in either collection, this hash-table stores a vector describing how often the word appeared in spam and often it appeared in non-spam mails.

If you want to regenerate the statistics from scratch, you need to reset the dictionary.

Function: spam-stat-reset ¶: Reset the spam-stat hash-table, deleting all the statistics.

When you are done, you must save the dictionary. The dictionary may be rather large. If you will not update the dictionary incrementally (instead, you will recreate it once a month, for example), then you can reduce the size of the dictionary by deleting all words that did not appear often enough or that do not clearly belong to only spam or only non-spam mails.

Function: spam-stat-reduce-size ¶: Reduce the size of the dictionary. Use this only if you do not want to update the dictionary incrementally.

Function: spam-stat-save ¶: Save the dictionary.

Variable: spam-stat-file ¶: The filename used to store the dictionary. This defaults to ~/.spam-stat.el.