# doubt in automatic email classification using naive bayes algorithm

gayathri murugesan

Ranch Hand

Posts: 32

posted 6 years ago

i am planning to do automatic classification of emails as personal,business,etc..and store in appropriate folder using naive bayes algorithm. Here Features are the keywords in the document and classes are the folder . But i am stuck after that step.please help me on how to apply naive bayes algorithm to my automatic mail classification application.

Oleg Tikhonov

Ranch Hand

Posts: 55

gayathri murugesan

Ranch Hand

Posts: 32

Oleg Tikhonov

Ranch Hand

Posts: 55

posted 6 years ago

-----------------------------------------------------

| description

-----------------------------------------------------

A | is a mail belonging to folder F_1

-----------------------------------------------------

B | is a mail belonging to folder F_2

-----------------------------------------------------

C | has a mail been classified before

-----------------------------------------------------

P | will a mail be classified to F_2

-----------------------------------------------------

Let’s assume that a mail that belonging to folder F_1, is also belonging to folder F_2, and has

been classified before. We want to predict the probability that the mail will be classified to F_2:

Pr{P=T|A=T,B=T,C=T}=Pr{A=T,B=T,C=T|P=T}Pr{T}/Pr{A=T,B=T,C=T}

Pr{P=F|A=T,B=T,C=T}=Pr{A=T,B=T,C=T|P=F}Pr{F}/Pr {A=T,B=T,C=T}

One of the easiest ways to compute an event’s probability is to take its frequency count.

In our table for example, all A,B,C events happened 20 times, event A happened 5 times, event B - 12, event c - 3.

Pr{A}=5/20; Pr{B}=12/20; Pr{C}=3/20.

Pr{A or B } = Pr{A} + Pr{B} – Pr{A and B}

Pr{A and B} = Pr{A}Pr{B|A} = Pr{B} Pr{A|B} - Bayes' rule

output attribute could be either T - true or F -false.

Something like that.

| description

-----------------------------------------------------

A | is a mail belonging to folder F_1

-----------------------------------------------------

B | is a mail belonging to folder F_2

-----------------------------------------------------

C | has a mail been classified before

-----------------------------------------------------

P | will a mail be classified to F_2

-----------------------------------------------------

Let’s assume that a mail that belonging to folder F_1, is also belonging to folder F_2, and has

been classified before. We want to predict the probability that the mail will be classified to F_2:

Pr{P=T|A=T,B=T,C=T}=Pr{A=T,B=T,C=T|P=T}Pr{T}/Pr{A=T,B=T,C=T}

Pr{P=F|A=T,B=T,C=T}=Pr{A=T,B=T,C=T|P=F}Pr{F}/Pr {A=T,B=T,C=T}

One of the easiest ways to compute an event’s probability is to take its frequency count.

In our table for example, all A,B,C events happened 20 times, event A happened 5 times, event B - 12, event c - 3.

Pr{A}=5/20; Pr{B}=12/20; Pr{C}=3/20.

Pr{A or B } = Pr{A} + Pr{B} – Pr{A and B}

Pr{A and B} = Pr{A}Pr{B|A} = Pr{B} Pr{A|B} - Bayes' rule

output attribute could be either T - true or F -false.

Something like that.

gayathri murugesan

Ranch Hand

Posts: 32

posted 6 years ago

i must find the keyword in the incoming mail and determine which folder is suitable for the mail. i am stuck at applying the naive bayes algorithm to this problem.

for example :if i find the keywords in the mail as

microsoft offers windows

then

suppose there are two folders personal,technology

then how could i apply naive bayes algorithm to classify the mail with keywords "microsoft offers windows" in to the appropriate folder.

sorry for not explaining my problem in detail before.

for example :if i find the keywords in the mail as

microsoft offers windows

then

suppose there are two folders personal,technology

then how could i apply naive bayes algorithm to classify the mail with keywords "microsoft offers windows" in to the appropriate folder.

sorry for not explaining my problem in detail before.

gayathri murugesan

Ranch Hand

Posts: 32

posted 6 years ago

according to my example:

p(personal) * p(microsoft | personal) * p(offers | personal) * p(windows | personal)

p(technology) * p(microsoft | technology) * p(offers |technology) *p(windows |technology)

have i arrived at the correct step.

if so what would be the probability of the folders and the probability of the word given folder.

p(personal) * p(microsoft | personal) * p(offers | personal) * p(windows | personal)

p(technology) * p(microsoft | technology) * p(offers |technology) *p(windows |technology)

have i arrived at the correct step.

if so what would be the probability of the folders and the probability of the word given folder.

gayathri murugesan

Ranch Hand

Posts: 32

posted 6 years ago

i have a doubt over here..

for example

suppose already i have stored the mails with the keywords microsoft , windows , iphone , itunes , ipod into technology folder and mails with the keyword market,home,school,college in to the personal folder.

then

p(personal)=0.5

p(technology)=0.5

p(technology) * p(microsoft | technology) * p(offers |technology) *p(windows |technology)

o.5 * 1/5 * 0 =0

p(personal) * p(microsoft | personal) * p(offers | personal) * p(windows | personal)

0.5 * 0 =0

actually mail with the keyword "microsoft offers technolgy" should be classified in to technology folder but the probability turns out to be zero.

and so i dont know whether i am going in the right path.

for example

suppose already i have stored the mails with the keywords microsoft , windows , iphone , itunes , ipod into technology folder and mails with the keyword market,home,school,college in to the personal folder.

then

p(personal)=0.5

p(technology)=0.5

p(technology) * p(microsoft | technology) * p(offers |technology) *p(windows |technology)

o.5 * 1/5 * 0 =0

p(personal) * p(microsoft | personal) * p(offers | personal) * p(windows | personal)

0.5 * 0 =0

actually mail with the keyword "microsoft offers technolgy" should be classified in to technology folder but the probability turns out to be zero.

and so i dont know whether i am going in the right path.