karthik raghunathan

Greenhorn

Posts: 13

posted 5 years ago

edit : deleted original post, removed links to github, copied code here instead. corrected spelling.

My naive bayes classifier started out as a spam filter and now has been recruited to classify whether a text is by Dickens or Twain.

First of all, would this be the right forum to ask this question ?

Second, it doesn't work very well. Can anyone help me correct the algo ? I sorta copied some of it from shiffman.net tutorial, which sorta uses the 'paulgraham : a plan for spam' approach.

ps : the code is not in OOP style, it's more or less procedural. Is this a problem ?

My naive bayes classifier started out as a spam filter and now has been recruited to classify whether a text is by Dickens or Twain.

First of all, would this be the right forum to ask this question ?

Second, it doesn't work very well. Can anyone help me correct the algo ? I sorta copied some of it from shiffman.net tutorial, which sorta uses the 'paulgraham : a plan for spam' approach.

ps : the code is not in OOP style, it's more or less procedural. Is this a problem ?

Dave Trower

Ranch Hand

Posts: 87

posted 5 years ago

I read an article on how spam filters work but my experience is limited to that one article. You need to create a dictionary where for each word used in any book returns a probability of if the book is twain or Dickens. I think you do this by counting the frequency of each word.

Then when you are given a sample book, you look up the probabilities for each word from the dictionary and then apply the Bayesian algorithm.

Let me know if this helps.

Then when you are given a sample book, you look up the probabilities for each word from the dictionary and then apply the Bayesian algorithm.

Let me know if this helps.

karthik raghunathan

Greenhorn

Posts: 13

Dave Trower

Ranch Hand

Posts: 87

posted 5 years ago

I would suggest you google the words "bayesian for spam". I do not have the original article but this is a good one:

webpage

Here is a quote from the article:

This word probability is calculated as follows: If the word “mortgage” occurs in 400 of 3,000 spam mails and

in 5 out of 300 legitimate emails, for example, then its spam probability would be 0.8889 (that is, [400/3000]

divided by [5/300 + 400/3000]).

So now in the dictionary, the word mortgage probability is 0.8889.

So if the word mortgage is used in an e-mail, there 88.89% chance the e-mail is spam. However, the bayesian filter looks at all words in an e-mail. So the total probability of an e-mail being spam would change based on the other words.

In your case, you build a dictionary based on how often a word appears in which of the two works.

I think the output of your program should be something like:

There is a 99.3% chance the book I just looked at is Twain.

webpage

Here is a quote from the article:

This word probability is calculated as follows: If the word “mortgage” occurs in 400 of 3,000 spam mails and

in 5 out of 300 legitimate emails, for example, then its spam probability would be 0.8889 (that is, [400/3000]

divided by [5/300 + 400/3000]).

So now in the dictionary, the word mortgage probability is 0.8889.

So if the word mortgage is used in an e-mail, there 88.89% chance the e-mail is spam. However, the bayesian filter looks at all words in an e-mail. So the total probability of an e-mail being spam would change based on the other words.

In your case, you build a dictionary based on how often a word appears in which of the two works.

I think the output of your program should be something like:

There is a 99.3% chance the book I just looked at is Twain.

karthik raghunathan

Greenhorn

Posts: 13

posted 5 years ago

Okay , that is pretty much what I am doing ,except for the final part, where I combine probabilities for each of the words.

I am making two dictionaries - one for spam and one for ham, then I calculate spam probability by doing rSpam/ (rSpam + rHam)

So I _am_ on the right track. Let me see if I am doing something wrong in the combining of probabilities .....

I also return a default of 0.1 if the probability is 0. That might be a downer..

I'll update in a day

I am making two dictionaries - one for spam and one for ham, then I calculate spam probability by doing rSpam/ (rSpam + rHam)

So I _am_ on the right track. Let me see if I am doing something wrong in the combining of probabilities .....

I also return a default of 0.1 if the probability is 0. That might be a downer..

I'll update in a day