programming forums Java Java JSRs Mobile Certification Databases Caching Books Engineering OS Languages Frameworks Products This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
Sheriffs:
Saloon Keepers:
Bartenders:

# doubt in automatic email classification using naive bayes algorithm

gayathri murugesan
Ranch Hand
Posts: 32
i am planning to do automatic classification of emails as personal,business,etc..and store in appropriate folder using naive bayes algorithm. Here Features are the keywords in the document and classes are the folder . But i am stuck after that step.please help me on how to apply naive bayes algorithm to my automatic mail classification application.

Oleg Tikhonov
Ranch Hand
Posts: 55
Hi,
Here Features are the keywords in the document and classes are the folder . But i am stuck after that step.

Oleg.

gayathri murugesan
Ranch Hand
Posts: 32
i am confused at how to apply this algorithm to our application of automatic classification of mails.can you please tell how to calculate the probability of a message belonging to a folder.

Oleg Tikhonov
Ranch Hand
Posts: 55
-----------------------------------------------------
| description
-----------------------------------------------------
A | is a mail belonging to folder F_1
-----------------------------------------------------
B | is a mail belonging to folder F_2
-----------------------------------------------------
C | has a mail been classified before
-----------------------------------------------------
P | will a mail be classified to F_2
-----------------------------------------------------

Let’s assume that a mail that belonging to folder F_1, is also belonging to folder F_2, and has
been classified before. We want to predict the probability that the mail will be classified to F_2:
Pr{P=T|A=T,B=T,C=T}=Pr{A=T,B=T,C=T|P=T}Pr{T}/Pr{A=T,B=T,C=T}
Pr{P=F|A=T,B=T,C=T}=Pr{A=T,B=T,C=T|P=F}Pr{F}/Pr {A=T,B=T,C=T}

One of the easiest ways to compute an event’s probability is to take its frequency count.
In our table for example, all A,B,C events happened 20 times, event A happened 5 times, event B - 12, event c - 3.
Pr{A}=5/20; Pr{B}=12/20; Pr{C}=3/20.

Pr{A or B } = Pr{A} + Pr{B} – Pr{A and B}
Pr{A and B} = Pr{A}Pr{B|A} = Pr{B} Pr{A|B} - Bayes' rule
output attribute could be either T - true or F -false.
Something like that.

David Newton
Author
Rancher
Posts: 12617
It's still not clear to me where you're stuck: implementing the algorithm itself? Determining how to use the results?

gayathri murugesan
Ranch Hand
Posts: 32
i must find the keyword in the incoming mail and determine which folder is suitable for the mail. i am stuck at applying the naive bayes algorithm to this problem.

for example :if i find the keywords in the mail as

microsoft offers windows

then

suppose there are two folders personal,technology

then how could i apply naive bayes algorithm to classify the mail with keywords "microsoft offers windows" in to the appropriate folder.

sorry for not explaining my problem in detail before.

David Newton
Author
Rancher
Posts: 12617
Have you already generated your score(s)? If you have, shouldn't it just be a matter of picking your cut-off point?

gayathri murugesan
Ranch Hand
Posts: 32
according to my example:

p(personal) * p(microsoft | personal) * p(offers | personal) * p(windows | personal)

p(technology) * p(microsoft | technology) * p(offers |technology) *p(windows |technology)

have i arrived at the correct step.

if so what would be the probability of the folders and the probability of the word given folder.

David Newton
Author
Rancher
Posts: 12617
So you've arrived at a spam probability, right? What's left to do?

gayathri murugesan
Ranch Hand
Posts: 32
i have a doubt over here..

for example

suppose already i have stored the mails with the keywords microsoft , windows , iphone , itunes , ipod into technology folder and mails with the keyword market,home,school,college in to the personal folder.

then

p(personal)=0.5

p(technology)=0.5

p(technology) * p(microsoft | technology) * p(offers |technology) *p(windows |technology)

o.5 * 1/5 * 0 =0

p(personal) * p(microsoft | personal) * p(offers | personal) * p(windows | personal)

0.5 * 0 =0

actually mail with the keyword "microsoft offers technolgy" should be classified in to technology folder but the probability turns out to be zero.
and so i dont know whether i am going in the right path.

David Newton
Author
Rancher
Posts: 12617
I'd say no, if it's not giving you the result you want--you might need to tweak your algorithm a bit.

gayathri murugesan
Ranch Hand
Posts: 32
thanks a lot for clearing my doubts.

 With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.