Pandas - 给定特定 b 的条件概率

Question

提问by Hamid K

I have DataFrame with two columns of "a" and "b". How can I find the conditional probability of "a" given specific "b"?

我有两列“a”和“b”的DataFrame。如何找到给定特定“b”的“a”的条件概率？

df.groupby('a').groupby('b')

does not work. Lets assume I have 3 categories in column a, for each specific on I have 5 categories of b. What I need to do is to find total number of on class of b for each class of a. I tried apply command, but I think I do not know how to use it properly.

不起作用。假设我在 a 列中有 3 个类别，对于每个特定的我有 5 个类别的 b。我需要做的是为a的每个类找到b类的总数。我尝试了 apply 命令，但我想我不知道如何正确使用它。

df.groupby('a').apply(lambda x: x[x['b']] == '...').count()

Answer 1

回答by maxymoo

To find the total number of class bfor each instance of class ayou would do

要查找您将执行的b每个类实例的类总数a

df.groupby('a').b.value_counts()

For example, create a DataFrame as below:

例如，创建一个 DataFrame 如下：

df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})

     A      B         C         D
0  foo    one -1.565185 -0.465763
1  bar    one  2.499516 -0.941229
2  foo    two -0.091160  0.689009
3  bar  three  1.358780 -0.062026
4  foo    two -0.800881 -0.341930
5  bar    two -0.236498  0.198686
6  foo    one -0.590498  0.281307
7  foo  three -1.423079  0.424715

Then:

然后：

df.groupby('A')['B'].value_counts()

A
bar  one      1
     two      1
     three    1
foo  one      2
     two      2
     three    1

To convert this to a conditional probability, you need to divide by the total size of each group.

要将其转换为条件概率，您需要除以每个组的总大小。

You can either do it with another groupby:

你可以用另一个 groupby 来做：

df.groupby('A')['B'].value_counts() / df.groupby('A')['B'].count()

A
bar  one      0.333333
     two      0.333333
     three    0.333333
foo  one      0.400000
     two      0.400000
     three    0.200000
dtype: float64

Or you can apply a lambdafunction onto the groups:

或者您可以将lambda函数应用于组：

df.groupby('a').b.apply(lambda g: g.value_counts()/len(g))

Answer 2

回答by cggarvey

You can pass in a list to groupby:

您可以将列表传递给 groupby：

df.groupby(['a','b']).count()

Answer 3

回答by Okry Dokry

Answer:

回答：

This is possible to do using Pandas crosstab function. Given the description of the problem where Dataframe is called 'df', with columns 'a' and 'b'

这可以使用 Pandas 交叉表功能来完成。鉴于 Dataframe 被称为“df”的问题的描述，列“a”和“b”

pd.crosstab(df.a, df.b, normalize='columns')

Will return a Dataframe representing P(a | b)

将返回一个 Dataframe 表示 P(a | b)

https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html

Explanation:

解释：

Consider the DataFrame:

考虑数据帧：

df = pd.DataFrame({'a':['x', 'x', 'x', 'y', 'y', 'y', 'y', 'z'],
                   'b':['1', '2', '3', '4','5', '1', '2', '3']})

Looking at columns a and b

查看列 a 和 b

df[["a", "b"]]

We have

我们有

Then

然后

pd.crosstab(df.a, df.b)

returns the frequency table of df.a and df.b with the rows being values of df.a and the columns being values of df.b

返回 df.a 和 df.b 的频率表，其中行是 df.a 的值，列是 df.b 的值

b   1   2   3   4   5
a                   
x   1   1   1   0   0
y   1   1   0   1   1
z   0   0   1   0   0

We can instead use the normalize keyword to get the table of conditional probabilities P(a | b)

我们可以使用 normalize 关键字来获取条件概率表 P(a | b)

pd.crosstab(df.a, df.b, normalize='columns')

Which will normalize based on column value, or in our case, return a DataFrame where the columns represent the conditional probabilities P(a | b=B)for specific values of B

这将根据列值进行归一化，或者在我们的情况下，返回一个 DataFrame，其中列表示P(a | b=B)B 的特定值的条件概率

b    1   2   3   4   5
a
x   0.5 0.5 0.5 0.0 0.0
y   0.5 0.5 0.0 1.0 1.0
z   0.0 0.0 0.5 0.0 0.0

Notice, the columns sum to 1.

请注意，列总和为 1。

If we would instead prefer to get P(b | a), we could normalize over the rows

如果我们更喜欢 get P(b | a)，我们可以对行进行标准化

pd.crosstab(df.a, df.b, normalize='rows')

To get

要得到

b      1           2           3         4       5
a                   
x   0.333333    0.333333    0.333333    0.00    0.00
y   0.250000    0.250000    0.000000    0.25    0.25
z   0.000000    0.000000    1.000000    0.00    0.00

Where the rows represent the conditional probabilities P(b | a=A)for specific values of A. Notice, the rows sum to 1.

其中行表示P(b | a=A)A 的特定值的条件概率。请注意，行总和为 1。

Answer 4

回答by Carlos H Zelada

You could try this function,

你可以试试这个功能

def conprob(pd1,pd2,transpose=1):
    if transpose==0:
        table=pd.crosstab(pd1,pd2)
    else:
        table=pd.crosstab(pd2,pd1)
    cnames=table.columns.values
    weights=1/table[cnames].sum()
    out=table*weights
    pc=table[cnames].sum()/table[cnames].sum().sum()
    table=table.transpose()
    cnames=table.columns.values
    p=table[cnames].sum()/table[cnames].sum().sum()
    out['p']=p
    return out

This return de conditional probability P( row |column )

这返回 de 条件概率 P( row |column )

Answer 5

回答by Hamid K

Consider the DataFrame that Maxymoo suggested:

考虑 Maxymoo 建议的 DataFrame：

df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})

df
     A      B         C         D
0  foo    one  0.229206 -1.899999
1  bar    one  0.174972  0.328746
2  foo    two -1.384699 -1.691151
3  bar  three -1.008328 -0.915467
4  foo    two -0.065298 -0.107240
5  bar    two  1.871916  0.798135
6  foo    one  1.589609 -1.682237
7  foo  three  2.292783  0.639595

Lets assume that we are interested to calculate the probability of (y = foo) given x = one: P(y=foo|x=one) = ?

假设我们有兴趣计算 (y = foo) 给定 x = 1 的概率： P(y=foo|x=one) = ?

Approach 1:

方法一：

df.groupby('B')['A'].value_counts()/df.groupby('B')['A'].count()
B         
one    foo    0.666667
       bar    0.333333
three  foo    0.500000
       bar    0.500000
two    foo    0.666667
       bar    0.333333
dtype: float64

So the answer is: 0.6667

所以答案是：0.6667

Approach 2:

方法二：

Probability of x = one: 0.375

x = 1 的概率：0.375

df['B'].value_counts()/df['B'].count()
one      0.375
two      0.375
three    0.250
dtype: float64

Probability of y = foo: 0.625

y = foo 的概率：0.625

df['A'].value_counts()/df['A'].count()
foo    0.625
bar    0.375
dtype: float64

Probability of (x=one|y=foo): 0.4

(x=one|y=foo) 的概率：0.4

df.groupby('A')['B'].value_counts()/df.groupby('A')['B'].count()
A         
bar  one      0.333333
     two      0.333333
     three    0.333333
foo  one      0.400000
     two      0.400000
     three    0.200000
dtype: float64

Therefore: P(y=foo|x=one) = P(x=one|y=foo)*P(y=foo)/P(x=one) = 0.4 * 0.625 / 0.375 = 0.6667

因此：P(y=foo|x=one) = P(x=one|y=foo)*P(y=foo)/P(x=one) = 0.4 * 0.625 / 0.375 = 0.6667

Pandas - 给定特定 b 的条件概率

提问by Hamid K

回答by maxymoo

回答by cggarvey

回答by Okry Dokry

Answer:

回答：

Explanation:

解释：

回答by Carlos H Zelada

回答by Hamid K

相关推荐

最近更新

标签

Pandas - 给定特定 b 的条件概率

提问by Hamid K

回答by maxymoo

回答by cggarvey

回答by Okry Dokry

Answer:

回答：

Explanation:

解释：

回答by Carlos H Zelada

回答by Hamid K

相关推荐

pandas DataFrame 对象没有属性“样本”

如何在 Pandas 中生成多个交互项？

pandas 熊猫用以前的非零值替换零

pandas 根据条件更新熊猫数据框的值

相关推荐

最近更新

标签