bayes

问题的提出 ¹

class

如果想判断未知样本的类别，即，已知它的三个属性X1、X2、X3，判断它是属于第一类（C=1）还是第二类（C=2）。

$P(C=1|X1,X2,X3)>P(C=2|X1,X2,X3)$，给定数据的X1、X2、X3后，数据属于类别1的概率要大于属于类别2，即说明现有样本支持未知样本属于类别1，判定为类别1。
$P(C=1|X1,X2,X3)<P(C=2|X1,X2,X3)$，则说明现有样本支持未知样本属于类别2，判定为类别2。

如何得到$P(C=1|X1,X2,X3)$和$P(C=2|X1,X2,X3)$这两个概率呢？答案是得不到。但是没关系，因为，只要知道这两个谁大谁小就可以进行判断：

$P(C=1|X1,X2,X3)>P(C=2|X1,X2,X3)$，则判定类别为1；
$P(C=1|X1,X2,X3)<P(C=2|X1,X2,X3)$，则判定类别为2；

贝叶斯定理就提供了方法进行这种比较。

贝叶斯定理

\[P(C|X) = \frac{ P(X|C)P(C)}{ P(X) }\]

P(C|X)是给定属性X下，C的后验概率
P(C)是C的先验概率

该公式被称为“贝叶斯定理”。

根据贝叶斯定理，我们想找出最大的P(C|X)，由于P(X)对所有类为常数，只要找出最大的P(X|C)P(C)即可，这便是朴素贝叶斯分类的基础。

朴素贝叶斯分类

朴素贝叶斯分类器采用了属性条件独立性假设：对已知类别，假设所有属性相互独立。²

\[P(C|X) = \frac{ P(X|C)P(C)}{ P(X) } = \frac{P(C)}{P(X)} \prod_{i = 1}^{d}P(X_i|C)\]

最小化分类错误率: $h^{*}(x) = arg max P(c|x)$

对所有类别来说P(X)相同，因此：

\[h_{naivebayes}^{*}(X) = arg max P(C) \prod_{i=1}^{d} P(X_i|C)\]

利用贝叶斯定理，找出最大的P(X|C)P(C)即可对未知样本进行分类，如max{P(X|C)P(C)}=P(X|C=n)P(C=n)，则说明未知样本属于第n类，其中，

（1）P(C=i)=Si/S，Si是类Ci中的训练样本数，S是训练样本总数；

（2）P(X|C=i)的计算开销可能非常大，因为会涉及到很多属性变量，这里可以做“属性值互相条件独立”的假定，即属性间不存在依赖关系：

Naive Bayes

PlayTennis (i.e., decide whether our friend will play tennis or not in a given day) ³

#data
data = [
    {"outlook":"sunny", "temp":"hot", "humidity":"high", "wind":"weak", "class":"no" },
    {"outlook":"sunny", "temp":"hot", "humidity":"high", "wind":"strong", "class":"no" },
    {"outlook":"overcast", "temp":"hot", "humidity":"high", "wind":"weak", "class":"yes" },
    {"outlook":"rain", "temp":"mild", "humidity":"high", "wind":"weak", "class":"yes" },
    {"outlook":"rain", "temp":"cool", "humidity":"normal", "wind":"weak", "class":"yes" },
    {"outlook":"rain", "temp":"cool", "humidity":"normal", "wind":"strong", "class":"no" },
    {"outlook":"overcast", "temp":"cool", "humidity":"normal", "wind":"strong", "class":"yes" },
    {"outlook":"sunny", "temp":"mild", "humidity":"high", "wind":"weak", "class":"no" },
    {"outlook":"sunny", "temp":"cool", "humidity":"normal", "wind":"weak", "class":"yes" },
    {"outlook":"rain", "temp":"mild", "humidity":"normal", "wind":"weak", "class":"yes" },  
    {"outlook":"sunny", "temp":"mild", "humidity":"normal", "wind":"strong", "class":"yes" },
    {"outlook":"overcast", "temp":"mild", "humidity":"high", "wind":"strong", "class":"yes" },
    {"outlook":"overcast", "temp":"hot", "humidity":"normal", "wind":"weak", "class":"yes" },
    {"outlook":"rain", "temp":"mild", "humidity":"high", "wind":"strong", "class":"no" }]

import pandas as pd
pd.DataFrame(data)

	class	humidity	outlook	temp	wind
0	no	high	sunny	hot	weak
1	no	high	sunny	hot	strong
2	yes	high	overcast	hot	weak
3	yes	high	rain	mild	weak
4	yes	normal	rain	cool	weak
5	no	normal	rain	cool	strong
6	yes	normal	overcast	cool	strong
7	no	high	sunny	mild	weak
8	yes	normal	sunny	cool	weak
9	yes	normal	rain	mild	weak
10	yes	normal	sunny	mild	strong
11	yes	high	overcast	mild	strong
12	yes	normal	overcast	hot	weak
13	no	high	rain	mild	strong

test={"outlook":"sunny","temp":"cool","humidity":"high","wind":"strong"}

#Calculate the Prob. of class:cls

def P(data,cls_val,cls_name="class"):
    count = 0.0     
    for e in data:
        if e[cls_name] == cls_val:
            count += 1
    return count/len(data)

# The probability of play or not
PY, PN = P(data,"yes"), P(data, "no")
PY, PN

(0.6428571428571429, 0.35714285714285715)

#Calculate the Prob(attr|cls)
def PT(data,cls_val,attr_name,attr_val,cls_name="class"):
    count1 = 0.0
    count2 = 0.0
    for e in data:
        if e[cls_name] == cls_val:
            count1 += 1
            if e[attr_name] == attr_val:
                count2 += 1
    return count2/count1

# The conditional probability of play or not
PT(data,"yes", "outlook", "sunny"), PT(data,"no", "outlook", "sunny")

(0.2222222222222222, 0.6)

#Calculate the NB
def NB(data,test,cls_y,cls_n):
    PY = P(data,cls_y)
    PN = P(data,cls_n)
    print 'The probability of play or not:', PY,'vs.', PN
    for key,val in test.items():
        PY *= PT(data,cls_y,key,val)
        PN *= PT(data,cls_n,key,val)
        print key, val, '-->play or not:-->', PY, PN
    return {cls_y:PY,cls_n:PN}

#calculate     
NB(data,test,"yes","no")

The probability of play or not: 0.642857142857 vs. 0.357142857143
outlook sunny -->play or not:--> 0.142857142857 0.214285714286
wind strong -->play or not:--> 0.047619047619 0.128571428571
temp cool -->play or not:--> 0.015873015873 0.0257142857143
humidity high -->play or not:--> 0.00529100529101 0.0205714285714

{'no': 0.020571428571428574, 'yes': 0.005291005291005291}

#calculate  
NB(data,{"outlook":"sunny","temp":"hot","humidity":"normal","wind":"weak"},"yes","no")

The probability of play or not: 0.642857142857 vs. 0.357142857143
outlook sunny -->play or not:--> 0.142857142857 0.214285714286
wind weak -->play or not:--> 0.0952380952381 0.0857142857143
temp hot -->play or not:--> 0.021164021164 0.0342857142857
humidity normal -->play or not:--> 0.0141093474427 0.00685714285714

{'no': 0.006857142857142858, 'yes': 0.014109347442680775}

Note

以下内容来自【数说工作室】金融数据挖掘之朴素贝叶斯 http://www.ppvke.com/Blog/archives/6431 ↩
周志华 2016 机器学习 p150-151 ↩
Mitchell Machine Learning http://www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml ↩

Share on

Twitter Facebook Google+ LinkedIn

Show Comments

利用python实现naive bayes算法

王成军

问题的提出 ¹

贝叶斯定理

朴素贝叶斯分类

Naive Bayes

Note

Share on

Leave a Comment

You May Also Enjoy

王成军副教授专著《跨越网络的门槛》出版

寻找真、假新闻中的“丑小鸭”：使用窥视策略预测新闻扩散规模和真实性

计算传播网迁移至豆瓣小组

思考真实熵

利用python实现naive bayes算法

王成军

问题的提出 1

贝叶斯定理

朴素贝叶斯分类

Naive Bayes

Note

Share on

Leave a Comment

You May Also Enjoy

王成军副教授专著《跨越网络的门槛》出版

寻找真、假新闻中的“丑小鸭”：使用窥视策略预测新闻扩散规模和真实性

计算传播网迁移至豆瓣小组

思考真实熵

问题的提出 ¹