Pandas - 给定特定 b 的条件概率
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33468976/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas - Conditional Probability of a given specific b
提问by Hamid K
I have DataFrame with two columns of "a" and "b". How can I find the conditional probability of "a" given specific "b"?
我有两列“a”和“b”的DataFrame。如何找到给定特定“b”的“a”的条件概率?
df.groupby('a').groupby('b')
does not work. Lets assume I have 3 categories in column a, for each specific on I have 5 categories of b. What I need to do is to find total number of on class of b for each class of a. I tried apply command, but I think I do not know how to use it properly.
不起作用。假设我在 a 列中有 3 个类别,对于每个特定的我有 5 个类别的 b。我需要做的是为a的每个类找到b类的总数。我尝试了 apply 命令,但我想我不知道如何正确使用它。
df.groupby('a').apply(lambda x: x[x['b']] == '...').count()
回答by maxymoo
To find the total number of class b
for each instance of class a
you would do
要查找您将执行的b
每个类实例的类总数a
df.groupby('a').b.value_counts()
For example, create a DataFrame as below:
例如,创建一个 DataFrame 如下:
df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})
A B C D
0 foo one -1.565185 -0.465763
1 bar one 2.499516 -0.941229
2 foo two -0.091160 0.689009
3 bar three 1.358780 -0.062026
4 foo two -0.800881 -0.341930
5 bar two -0.236498 0.198686
6 foo one -0.590498 0.281307
7 foo three -1.423079 0.424715
Then:
然后:
df.groupby('A')['B'].value_counts()
A
bar one 1
two 1
three 1
foo one 2
two 2
three 1
To convert this to a conditional probability, you need to divide by the total size of each group.
要将其转换为条件概率,您需要除以每个组的总大小。
You can either do it with another groupby:
你可以用另一个 groupby 来做:
df.groupby('A')['B'].value_counts() / df.groupby('A')['B'].count()
A
bar one 0.333333
two 0.333333
three 0.333333
foo one 0.400000
two 0.400000
three 0.200000
dtype: float64
Or you can apply a lambda
function onto the groups:
或者您可以将lambda
函数应用于组:
df.groupby('a').b.apply(lambda g: g.value_counts()/len(g))
回答by cggarvey
You can pass in a list to groupby:
您可以将列表传递给 groupby:
df.groupby(['a','b']).count()
回答by Okry Dokry
Answer:
回答:
This is possible to do using Pandas crosstab function. Given the description of the problem where Dataframe is called 'df', with columns 'a' and 'b'
这可以使用 Pandas 交叉表功能来完成。鉴于 Dataframe 被称为“df”的问题的描述,列“a”和“b”
pd.crosstab(df.a, df.b, normalize='columns')
pd.crosstab(df.a, df.b, normalize='columns')
Will return a Dataframe representing P(a | b)
将返回一个 Dataframe 表示 P(a | b)
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html
Explanation:
解释:
Consider the DataFrame:
考虑数据帧:
df = pd.DataFrame({'a':['x', 'x', 'x', 'y', 'y', 'y', 'y', 'z'],
'b':['1', '2', '3', '4','5', '1', '2', '3']})
Looking at columns a and b
查看列 a 和 b
df[["a", "b"]]
df[["a", "b"]]
We have
我们有
a b
0 x 1
1 x 2
2 x 3
3 y 4
4 y 5
5 y 1
6 y 2
7 z 3
Then
然后
pd.crosstab(df.a, df.b)
pd.crosstab(df.a, df.b)
returns the frequency table of df.a and df.b with the rows being values of df.a and the columns being values of df.b
返回 df.a 和 df.b 的频率表,其中行是 df.a 的值,列是 df.b 的值
b 1 2 3 4 5
a
x 1 1 1 0 0
y 1 1 0 1 1
z 0 0 1 0 0
We can instead use the normalize keyword to get the table of conditional probabilities P(a | b)
我们可以使用 normalize 关键字来获取条件概率表 P(a | b)
pd.crosstab(df.a, df.b, normalize='columns')
pd.crosstab(df.a, df.b, normalize='columns')
Which will normalize based on column value, or in our case, return a DataFrame where the columns represent the conditional probabilities P(a | b=B)
for specific values of B
这将根据列值进行归一化,或者在我们的情况下,返回一个 DataFrame,其中列表示P(a | b=B)
B 的特定值的条件概率
b 1 2 3 4 5
a
x 0.5 0.5 0.5 0.0 0.0
y 0.5 0.5 0.0 1.0 1.0
z 0.0 0.0 0.5 0.0 0.0
Notice, the columns sum to 1.
请注意,列总和为 1。
If we would instead prefer to get P(b | a)
, we could normalize over the rows
如果我们更喜欢 get P(b | a)
,我们可以对行进行标准化
pd.crosstab(df.a, df.b, normalize='rows')
pd.crosstab(df.a, df.b, normalize='rows')
To get
要得到
b 1 2 3 4 5
a
x 0.333333 0.333333 0.333333 0.00 0.00
y 0.250000 0.250000 0.000000 0.25 0.25
z 0.000000 0.000000 1.000000 0.00 0.00
Where the rows represent the conditional probabilities P(b | a=A)
for specific values of A. Notice, the rows sum to 1.
其中行表示P(b | a=A)
A 的特定值的条件概率。请注意,行总和为 1。
回答by Carlos H Zelada
You could try this function,
你可以试试这个功能
def conprob(pd1,pd2,transpose=1):
if transpose==0:
table=pd.crosstab(pd1,pd2)
else:
table=pd.crosstab(pd2,pd1)
cnames=table.columns.values
weights=1/table[cnames].sum()
out=table*weights
pc=table[cnames].sum()/table[cnames].sum().sum()
table=table.transpose()
cnames=table.columns.values
p=table[cnames].sum()/table[cnames].sum().sum()
out['p']=p
return out
This return de conditional probability P( row |column )
这返回 de 条件概率 P( row |column )
回答by Hamid K
Consider the DataFrame that Maxymoo suggested:
考虑 Maxymoo 建议的 DataFrame:
df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})
df
A B C D
0 foo one 0.229206 -1.899999
1 bar one 0.174972 0.328746
2 foo two -1.384699 -1.691151
3 bar three -1.008328 -0.915467
4 foo two -0.065298 -0.107240
5 bar two 1.871916 0.798135
6 foo one 1.589609 -1.682237
7 foo three 2.292783 0.639595
Lets assume that we are interested to calculate the probability of (y = foo) given x = one: P(y=foo|x=one) = ?
假设我们有兴趣计算 (y = foo) 给定 x = 1 的概率: P(y=foo|x=one) = ?
Approach 1:
方法一:
df.groupby('B')['A'].value_counts()/df.groupby('B')['A'].count()
B
one foo 0.666667
bar 0.333333
three foo 0.500000
bar 0.500000
two foo 0.666667
bar 0.333333
dtype: float64
So the answer is: 0.6667
所以答案是:0.6667
Approach 2:
方法二:
Probability of x = one: 0.375
x = 1 的概率:0.375
df['B'].value_counts()/df['B'].count()
one 0.375
two 0.375
three 0.250
dtype: float64
Probability of y = foo: 0.625
y = foo 的概率:0.625
df['A'].value_counts()/df['A'].count()
foo 0.625
bar 0.375
dtype: float64
Probability of (x=one|y=foo): 0.4
(x=one|y=foo) 的概率:0.4
df.groupby('A')['B'].value_counts()/df.groupby('A')['B'].count()
A
bar one 0.333333
two 0.333333
three 0.333333
foo one 0.400000
two 0.400000
three 0.200000
dtype: float64
Therefore: P(y=foo|x=one) = P(x=one|y=foo)*P(y=foo)/P(x=one) = 0.4 * 0.625 / 0.375 = 0.6667
因此:P(y=foo|x=one) = P(x=one|y=foo)*P(y=foo)/P(x=one) = 0.4 * 0.625 / 0.375 = 0.6667