问题的提出 1
如果想判断未知样本的类别,即,已知它的三个属性X1、X2、X3,判断它是属于第一类(C=1)还是第二类(C=2)。
- $P(C=1|X1,X2,X3)>P(C=2|X1,X2,X3)$,给定数据的X1、X2、X3后,数据属于类别1的概率要大于属于类别2,即说明现有样本支持未知样本属于类别1,判定为类别1。
- $P(C=1|X1,X2,X3)<P(C=2|X1,X2,X3)$,则说明现有样本支持未知样本属于类别2,判定为类别2。
如何得到$P(C=1|X1,X2,X3)$和$P(C=2|X1,X2,X3)$这两个概率呢?答案是得不到
。但是没关系,因为,只要知道这两个谁大谁小就可以进行判断:
- $P(C=1|X1,X2,X3)>P(C=2|X1,X2,X3)$,则判定类别为1;
- $P(C=1|X1,X2,X3)<P(C=2|X1,X2,X3)$,则判定类别为2;
贝叶斯定理就提供了方法进行这种比较。
贝叶斯定理
\[P(C|X) = \frac{ P(X|C)P(C)}{ P(X) }\]- P(C|X)是给定属性X下,C的后验概率
- P(C)是C的先验概率
该公式被称为“贝叶斯定理”
。
根据贝叶斯定理,我们想找出最大的P(C|X),由于P(X)对所有类为常数,只要找出最大的P(X|C)P(C)即可,这便是朴素贝叶斯分类的基础。
朴素贝叶斯分类
朴素贝叶斯分类器采用了属性条件独立性假设:对已知类别,假设所有属性相互独立。2
\[P(C|X) = \frac{ P(X|C)P(C)}{ P(X) } = \frac{P(C)}{P(X)} \prod_{i = 1}^{d}P(X_i|C)\]最小化分类错误率: $h^{*}(x) = arg max P(c|x)$
对所有类别来说P(X)相同,因此:
\[h_{naivebayes}^{*}(X) = arg max P(C) \prod_{i=1}^{d} P(X_i|C)\]利用贝叶斯定理,找出最大的P(X|C)P(C)即可对未知样本进行分类,如max{P(X|C)P(C)}=P(X|C=n)P(C=n),则说明未知样本属于第n类,其中,
(1)P(C=i)=Si/S,Si是类Ci中的训练样本数,S是训练样本总数;
(2)P(X|C=i)的计算开销可能非常大,因为会涉及到很多属性变量,这里可以做“属性值互相条件独立”的假定,即属性间不存在依赖关系:
Naive Bayes
PlayTennis (i.e., decide whether our friend will play tennis or not in a given day) 3
#data
data = [
{"outlook":"sunny", "temp":"hot", "humidity":"high", "wind":"weak", "class":"no" },
{"outlook":"sunny", "temp":"hot", "humidity":"high", "wind":"strong", "class":"no" },
{"outlook":"overcast", "temp":"hot", "humidity":"high", "wind":"weak", "class":"yes" },
{"outlook":"rain", "temp":"mild", "humidity":"high", "wind":"weak", "class":"yes" },
{"outlook":"rain", "temp":"cool", "humidity":"normal", "wind":"weak", "class":"yes" },
{"outlook":"rain", "temp":"cool", "humidity":"normal", "wind":"strong", "class":"no" },
{"outlook":"overcast", "temp":"cool", "humidity":"normal", "wind":"strong", "class":"yes" },
{"outlook":"sunny", "temp":"mild", "humidity":"high", "wind":"weak", "class":"no" },
{"outlook":"sunny", "temp":"cool", "humidity":"normal", "wind":"weak", "class":"yes" },
{"outlook":"rain", "temp":"mild", "humidity":"normal", "wind":"weak", "class":"yes" },
{"outlook":"sunny", "temp":"mild", "humidity":"normal", "wind":"strong", "class":"yes" },
{"outlook":"overcast", "temp":"mild", "humidity":"high", "wind":"strong", "class":"yes" },
{"outlook":"overcast", "temp":"hot", "humidity":"normal", "wind":"weak", "class":"yes" },
{"outlook":"rain", "temp":"mild", "humidity":"high", "wind":"strong", "class":"no" }]
import pandas as pd
pd.DataFrame(data)
class | humidity | outlook | temp | wind | |
---|---|---|---|---|---|
0 | no | high | sunny | hot | weak |
1 | no | high | sunny | hot | strong |
2 | yes | high | overcast | hot | weak |
3 | yes | high | rain | mild | weak |
4 | yes | normal | rain | cool | weak |
5 | no | normal | rain | cool | strong |
6 | yes | normal | overcast | cool | strong |
7 | no | high | sunny | mild | weak |
8 | yes | normal | sunny | cool | weak |
9 | yes | normal | rain | mild | weak |
10 | yes | normal | sunny | mild | strong |
11 | yes | high | overcast | mild | strong |
12 | yes | normal | overcast | hot | weak |
13 | no | high | rain | mild | strong |
test={"outlook":"sunny","temp":"cool","humidity":"high","wind":"strong"}
#Calculate the Prob. of class:cls
def P(data,cls_val,cls_name="class"):
count = 0.0
for e in data:
if e[cls_name] == cls_val:
count += 1
return count/len(data)
# The probability of play or not
PY, PN = P(data,"yes"), P(data, "no")
PY, PN
(0.6428571428571429, 0.35714285714285715)
#Calculate the Prob(attr|cls)
def PT(data,cls_val,attr_name,attr_val,cls_name="class"):
count1 = 0.0
count2 = 0.0
for e in data:
if e[cls_name] == cls_val:
count1 += 1
if e[attr_name] == attr_val:
count2 += 1
return count2/count1
# The conditional probability of play or not
PT(data,"yes", "outlook", "sunny"), PT(data,"no", "outlook", "sunny")
(0.2222222222222222, 0.6)
#Calculate the NB
def NB(data,test,cls_y,cls_n):
PY = P(data,cls_y)
PN = P(data,cls_n)
print 'The probability of play or not:', PY,'vs.', PN
for key,val in test.items():
PY *= PT(data,cls_y,key,val)
PN *= PT(data,cls_n,key,val)
print key, val, '-->play or not:-->', PY, PN
return {cls_y:PY,cls_n:PN}
#calculate
NB(data,test,"yes","no")
The probability of play or not: 0.642857142857 vs. 0.357142857143
outlook sunny -->play or not:--> 0.142857142857 0.214285714286
wind strong -->play or not:--> 0.047619047619 0.128571428571
temp cool -->play or not:--> 0.015873015873 0.0257142857143
humidity high -->play or not:--> 0.00529100529101 0.0205714285714
{'no': 0.020571428571428574, 'yes': 0.005291005291005291}
#calculate
NB(data,{"outlook":"sunny","temp":"hot","humidity":"normal","wind":"weak"},"yes","no")
The probability of play or not: 0.642857142857 vs. 0.357142857143
outlook sunny -->play or not:--> 0.142857142857 0.214285714286
wind weak -->play or not:--> 0.0952380952381 0.0857142857143
temp hot -->play or not:--> 0.021164021164 0.0342857142857
humidity normal -->play or not:--> 0.0141093474427 0.00685714285714
{'no': 0.006857142857142858, 'yes': 0.014109347442680775}
Leave a Comment