Pandas - 给定特定 b 的条件概率

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33468976/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:08:30  来源:igfitidea点击:

Pandas - Conditional Probability of a given specific b

pythonpandas

提问by Hamid K

I have DataFrame with two columns of "a" and "b". How can I find the conditional probability of "a" given specific "b"?

我有两列“a”和“b”的DataFrame。如何找到给定特定“b”的“a”的条件概率?

df.groupby('a').groupby('b')

does not work. Lets assume I have 3 categories in column a, for each specific on I have 5 categories of b. What I need to do is to find total number of on class of b for each class of a. I tried apply command, but I think I do not know how to use it properly.

不起作用。假设我在 a 列中有 3 个类别,对于每个特定的我有 5 个类别的 b。我需要做的是为a的每个类找到b类的总数。我尝试了 apply 命令,但我想我不知道如何正确使用它。

df.groupby('a').apply(lambda x: x[x['b']] == '...').count()

回答by maxymoo

To find the total number of class bfor each instance of class ayou would do

要查找您将执行的b每个类实例的类总数a

df.groupby('a').b.value_counts()

For example, create a DataFrame as below:

例如,创建一个 DataFrame 如下:

df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})

     A      B         C         D
0  foo    one -1.565185 -0.465763
1  bar    one  2.499516 -0.941229
2  foo    two -0.091160  0.689009
3  bar  three  1.358780 -0.062026
4  foo    two -0.800881 -0.341930
5  bar    two -0.236498  0.198686
6  foo    one -0.590498  0.281307
7  foo  three -1.423079  0.424715

Then:

然后:

df.groupby('A')['B'].value_counts()

A
bar  one      1
     two      1
     three    1
foo  one      2
     two      2
     three    1

To convert this to a conditional probability, you need to divide by the total size of each group.

要将其转换为条件概率,您需要除以每个组的总大小。

You can either do it with another groupby:

你可以用另一个 groupby 来做:

df.groupby('A')['B'].value_counts() / df.groupby('A')['B'].count()

A
bar  one      0.333333
     two      0.333333
     three    0.333333
foo  one      0.400000
     two      0.400000
     three    0.200000
dtype: float64

Or you can apply a lambdafunction onto the groups:

或者您可以将lambda函数应用于组:

df.groupby('a').b.apply(lambda g: g.value_counts()/len(g))

回答by cggarvey

You can pass in a list to groupby:

您可以将列表传递给 groupby:

df.groupby(['a','b']).count()

回答by Okry Dokry

Answer:

回答:

This is possible to do using Pandas crosstab function. Given the description of the problem where Dataframe is called 'df', with columns 'a' and 'b'

这可以使用 Pandas 交叉表功能来完成。鉴于 Dataframe 被称为“df”的问题的描述,列“a”和“b”

pd.crosstab(df.a, df.b, normalize='columns')

pd.crosstab(df.a, df.b, normalize='columns')

Will return a Dataframe representing P(a | b)

将返回一个 Dataframe 表示 P(a | b)

https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html

https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html

Explanation:

解释:

Consider the DataFrame:

考虑数据帧:

df = pd.DataFrame({'a':['x', 'x', 'x', 'y', 'y', 'y', 'y', 'z'],
                   'b':['1', '2', '3', '4','5', '1', '2', '3']})

Looking at columns a and b

查看列 a 和 b

df[["a", "b"]]

df[["a", "b"]]

We have

我们有

    a   b
0   x   1
1   x   2
2   x   3
3   y   4
4   y   5
5   y   1
6   y   2
7   z   3

Then

然后

pd.crosstab(df.a, df.b)

pd.crosstab(df.a, df.b)

returns the frequency table of df.a and df.b with the rows being values of df.a and the columns being values of df.b

返回 df.a 和 df.b 的频率表,其中行是 df.a 的值,列是 df.b 的值

b   1   2   3   4   5
a                   
x   1   1   1   0   0
y   1   1   0   1   1
z   0   0   1   0   0

We can instead use the normalize keyword to get the table of conditional probabilities P(a | b)

我们可以使用 normalize 关键字来获取条件概率表 P(a | b)

pd.crosstab(df.a, df.b, normalize='columns')

pd.crosstab(df.a, df.b, normalize='columns')

Which will normalize based on column value, or in our case, return a DataFrame where the columns represent the conditional probabilities P(a | b=B)for specific values of B

这将根据列值进行归一化,或者在我们的情况下,返回一个 DataFrame,其中列表示P(a | b=B)B 的特定值的条件概率

b    1   2   3   4   5
a
x   0.5 0.5 0.5 0.0 0.0
y   0.5 0.5 0.0 1.0 1.0
z   0.0 0.0 0.5 0.0 0.0

Notice, the columns sum to 1.

请注意,列总和为 1。

If we would instead prefer to get P(b | a), we could normalize over the rows

如果我们更喜欢 get P(b | a),我们可以对行进行标准化

pd.crosstab(df.a, df.b, normalize='rows')

pd.crosstab(df.a, df.b, normalize='rows')

To get

要得到

b      1           2           3         4       5
a                   
x   0.333333    0.333333    0.333333    0.00    0.00
y   0.250000    0.250000    0.000000    0.25    0.25
z   0.000000    0.000000    1.000000    0.00    0.00

Where the rows represent the conditional probabilities P(b | a=A)for specific values of A. Notice, the rows sum to 1.

其中行表示P(b | a=A)A 的特定值的条件概率。请注意,行总和为 1。

回答by Carlos H Zelada

You could try this function,

你可以试试这个功能

def conprob(pd1,pd2,transpose=1):
    if transpose==0:
        table=pd.crosstab(pd1,pd2)
    else:
        table=pd.crosstab(pd2,pd1)
    cnames=table.columns.values
    weights=1/table[cnames].sum()
    out=table*weights
    pc=table[cnames].sum()/table[cnames].sum().sum()
    table=table.transpose()
    cnames=table.columns.values
    p=table[cnames].sum()/table[cnames].sum().sum()
    out['p']=p
    return out

This return de conditional probability P( row |column )

这返回 de 条件概率 P( row |column )

回答by Hamid K

Consider the DataFrame that Maxymoo suggested:

考虑 Maxymoo 建议的 DataFrame:

df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})

df
     A      B         C         D
0  foo    one  0.229206 -1.899999
1  bar    one  0.174972  0.328746
2  foo    two -1.384699 -1.691151
3  bar  three -1.008328 -0.915467
4  foo    two -0.065298 -0.107240
5  bar    two  1.871916  0.798135
6  foo    one  1.589609 -1.682237
7  foo  three  2.292783  0.639595

Lets assume that we are interested to calculate the probability of (y = foo) given x = one: P(y=foo|x=one) = ?

假设我们有兴趣计算 (y = foo) 给定 x = 1 的概率: P(y=foo|x=one) = ?

Approach 1:

方法一:

df.groupby('B')['A'].value_counts()/df.groupby('B')['A'].count()
B         
one    foo    0.666667
       bar    0.333333
three  foo    0.500000
       bar    0.500000
two    foo    0.666667
       bar    0.333333
dtype: float64

So the answer is: 0.6667

所以答案是:0.6667

Approach 2:

方法二:

Probability of x = one: 0.375

x = 1 的概率:0.375

df['B'].value_counts()/df['B'].count()
one      0.375
two      0.375
three    0.250
dtype: float64

Probability of y = foo: 0.625

y = foo 的概率:0.625

df['A'].value_counts()/df['A'].count()
foo    0.625
bar    0.375
dtype: float64

Probability of (x=one|y=foo): 0.4

(x=one|y=foo) 的概率:0.4

df.groupby('A')['B'].value_counts()/df.groupby('A')['B'].count()
A         
bar  one      0.333333
     two      0.333333
     three    0.333333
foo  one      0.400000
     two      0.400000
     three    0.200000
dtype: float64

Therefore: P(y=foo|x=one) = P(x=one|y=foo)*P(y=foo)/P(x=one) = 0.4 * 0.625 / 0.375 = 0.6667

因此:P(y=foo|x=one) = P(x=one|y=foo)*P(y=foo)/P(x=one) = 0.4 * 0.625 / 0.375 = 0.6667