"Arthur" <(E-Mail Removed)> wrote in message

news:(E-Mail Removed)...

> >Maybe there was some notice about using Python in

> >geophysic and the symposium book in one journal, so there was a sudden

> >spat of, say, three people who bought both.

>

> You would think the parameter for a statistically significant sample size

> would be a fundamental concept in this kind of thing. And no action taken

> before one was determined to exist.

>
Statistical tests take sample sizes into account (so e.g. a larger effect

will tend to be statistically significant for a smaller sample size).

Sample size calcs. are more useful when you're in a position to determine

how large the sample will be.

> OTOH, the concept of "coincidence" must necessarily be ruled out in AI, I

> would think.

>
Coincidence can't generally be ruled out, but you can look for relationships

in the (sample) data that would be unlikely to be present if the same

relationships weren't also present in the population.

> *Our* intelligence seems to give us a read as to where on the bell curve a

> particular event may lie, or a least some sense of when we are at an
extreme

> on the curve. Which we call coincidence. AI would probably have a

> particularly difficult time with this concept - it seems to me.

>
Some people have a difficult time with (or are unaware of) "statistical

thinking". Maybe some of them are involved in AI? (Well, of course some of

them are.

)

> Spam filtering software must need to tackle these kinds of issues.

>
It can do, and I've no doubt some of it does. Spam filtering is a

classification problem and can be handled in a variety of ways. It's

generally easy to come up with an overly complex set of rules / model that

will correctly classify sample data. But (as you know) the idea's to come

up with a set of rules / model that will correctly (as far as possible)

classify future data. As many spam filters use Bayesian methods, I would

guess that they might be fitted using Bayesian methods; in which case overly

complex models can be (at least partially) avoided through the choice of

prior, rather than significance testing.

What do Amazon use? My guess (unless it's something really naive) would be

association rules.

Duncan

> Art

>

>