# Examining word scores in Thunderbird spam filter

Discussion in 'Firefox' started by Jeff Evans, Mar 31, 2005.

1. ### Jeff EvansGuest

I know that Thunderbird uses Bayesian filtering for its spam filter. Is
there any way to examine what probabilities TB has assigned to each word
in a form that is understandable by humans? Or is the model described at:

http://en.wikipedia.org/wiki/Bayesian_filtering

oversimplified? Also is it possible to view/modify the threshold?

Jeff Evans, Mar 31, 2005

2. ### Moz ChampionGuest

Jeff Evans wrote:
> I know that Thunderbird uses Bayesian filtering for its spam filter. Is
> there any way to examine what probabilities TB has assigned to each word
> in a form that is understandable by humans? Or is the model described at:
>
> http://en.wikipedia.org/wiki/Bayesian_filtering
>
> oversimplified? Also is it possible to view/modify the threshold?

You teach TB's JMC by telling it which message is spam, and telling it
which message it marked as spam is not.
That adjusts the threshold, modifying it on an ongoiing basis.

Note: if you unmark a message JMC thought was spam, ALL the properties
used to determine it was spam are depreciatted

The 'threshold' is developed by each individual user, according to the
type and contents of the actual spam they recieve versus the non spam
they receive, so it varies accordingly. The actual threshold is
established when first activating JMC, and modified constantly everytime
you mark another message as spam (or unmark one).

Moz Champion, Apr 1, 2005

3. ### Jeff EvansGuest

Moz Champion wrote:
> Jeff Evans wrote:
>
>> I know that Thunderbird uses Bayesian filtering for its spam filter.
>> Is there any way to examine what probabilities TB has assigned to each
>> word in a form that is understandable by humans? Or is the model
>> described at:
>>
>> http://en.wikipedia.org/wiki/Bayesian_filtering
>>
>> oversimplified? Also is it possible to view/modify the threshold?

>
>
> You teach TB's JMC by telling it which message is spam, and telling it
> which message it marked as spam is not.
> That adjusts the threshold, modifying it on an ongoiing basis.
>
> Note: if you unmark a message JMC thought was spam, ALL the properties
> used to determine it was spam are depreciatted
>
>
> The 'threshold' is developed by each individual user, according to the
> type and contents of the actual spam they recieve versus the non spam
> they receive, so it varies accordingly. The actual threshold is
> established when first activating JMC, and modified constantly everytime
> you mark another message as spam (or unmark one).

Thanks for your response, but let me clarify a bit more. I understand
that as e-mail arrives, the filter calculates the Prob{Spam given Words}
according to the Words in the message, then based on that probability,
may mark it. What I'm interested in seeing is the Prob{Words given
Spam}, which I'm presuming is part of the "training" and is updated with
each successful or unsuccessful attempt. Basically, for curiousity's
sake, I'd just like to see how I have trained my filter by looking at
these values, if possible.

Jeff Evans, Apr 1, 2005
4. ### Moz ChampionGuest

Jeff Evans wrote:
> Moz Champion wrote:
>
>> Jeff Evans wrote:
>>
>>> I know that Thunderbird uses Bayesian filtering for its spam filter.
>>> Is there any way to examine what probabilities TB has assigned to
>>> each word in a form that is understandable by humans? Or is the
>>> model described at:
>>>
>>> http://en.wikipedia.org/wiki/Bayesian_filtering
>>>
>>> oversimplified? Also is it possible to view/modify the threshold?

>>
>>
>>
>> You teach TB's JMC by telling it which message is spam, and telling it
>> which message it marked as spam is not.
>> That adjusts the threshold, modifying it on an ongoiing basis.
>>
>> Note: if you unmark a message JMC thought was spam, ALL the properties
>> used to determine it was spam are depreciatted
>>
>>
>> The 'threshold' is developed by each individual user, according to the
>> type and contents of the actual spam they recieve versus the non spam
>> they receive, so it varies accordingly. The actual threshold is
>> established when first activating JMC, and modified constantly
>> everytime you mark another message as spam (or unmark one).

>
>
> Thanks for your response, but let me clarify a bit more. I understand
> that as e-mail arrives, the filter calculates the Prob{Spam given Words}
> according to the Words in the message, then based on that probability,
> may mark it. What I'm interested in seeing is the Prob{Words given
> Spam}, which I'm presuming is part of the "training" and is updated with
> each successful or unsuccessful attempt. Basically, for curiousity's
> sake, I'd just like to see how I have trained my filter by looking at
> these values, if possible.

You can see the various attributes added by lookiing at
training.dat in a text editor (its located in your profile folder)
each 'update' or modification may include several words or word groups
that increment or decrement the ratio, according to your usage.

Whether or not that will aid you in any meaningful manner tho, that I
cant say. Its only when the full training dat file is taken into
consideration that JMC determines whethere or not its spam

For example, here's a copy of the first few entries in my training.dat file

pokc510016

Moz Champion, Apr 1, 2005