带 Groupby 的 Python Pandas 条件求和
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17266129/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas Conditional Sum with Groupby
提问by AllenQ
Using sample data:
使用样本数据:
df = pd.DataFrame({'key1' : ['a','a','b','b','a'],
'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),
'data2' : np. random.randn(5)})
df
df
data1 data2 key1 key2
0 0.361601 0.375297 a one
1 0.069889 0.809772 a two
2 1.468194 0.272929 b one
3 -1.138458 0.865060 b two
4 -0.268210 1.250340 a one
I'm trying to figure out how to group the data by key1 and sum only the data1 values where key2 equals 'one'.
我试图弄清楚如何按 key1 对数据进行分组,并仅对 key2 等于“一”的 data1 值求和。
Here's what I've tried
这是我尝试过的
def f(d,a,b):
d.ix[d[a] == b, 'data1'].sum()
df.groupby(['key1']).apply(f, a = 'key2', b = 'one').reset_index()
But this gives me a dataframe with 'None' values
但这给了我一个带有“无”值的数据框
index key1 0
0 a None
1 b None
Any ideas here? I'm looking for the Pandas equivalent of the following SQL:
这里有什么想法吗?我正在寻找与以下 SQL 等效的 Pandas:
SELECT Key1, SUM(CASE WHEN Key2 = 'one' then data1 else 0 end)
FROM df
GROUP BY key1
FYI - I've seen conditional sums for pandas aggregatebut couldn't transform the answer provided there to work with sums rather than counts.
仅供参考 - 我已经看到了 Pandas 聚合的条件总和,但无法转换那里提供的答案以使用总和而不是计数。
Thanks in advance
提前致谢
采纳答案by Andy Hayden
First groupby the key1 column:
第一个 groupby key1 列:
In [11]: g = df.groupby('key1')
and then for each group take the subDataFrame where key2 equals 'one' and sum the data1 column:
然后对于每个组取 subDataFrame 其中 key2 等于 'one' 并对 data1 列求和:
In [12]: g.apply(lambda x: x[x['key2'] == 'one']['data1'].sum())
Out[12]:
key1
a 0.093391
b 1.468194
dtype: float64
To explain what's going on let's look at the 'a' group:
为了解释发生了什么,让我们看看“a”组:
In [21]: a = g.get_group('a')
In [22]: a
Out[22]:
data1 data2 key1 key2
0 0.361601 0.375297 a one
1 0.069889 0.809772 a two
4 -0.268210 1.250340 a one
In [23]: a[a['key2'] == 'one']
Out[23]:
data1 data2 key1 key2
0 0.361601 0.375297 a one
4 -0.268210 1.250340 a one
In [24]: a[a['key2'] == 'one']['data1']
Out[24]:
0 0.361601
4 -0.268210
Name: data1, dtype: float64
In [25]: a[a['key2'] == 'one']['data1'].sum()
Out[25]: 0.093391000000000002
It may be slightly easier/clearer to do this by restricting the dataframe to just those with key2 equals one first:
通过将数据帧限制为 key2 首先等于 1 的数据帧,这样做可能会更容易/更清晰:
In [31]: df1 = df[df['key2'] == 'one']
In [32]: df1
Out[32]:
data1 data2 key1 key2
0 0.361601 0.375297 a one
2 1.468194 0.272929 b one
4 -0.268210 1.250340 a one
In [33]: df1.groupby('key1')['data1'].sum()
Out[33]:
key1
a 0.093391
b 1.468194
Name: data1, dtype: float64
回答by Diego
I think that today with pandas 0.23 you can do this:
我认为今天使用 pandas 0.23 你可以做到这一点:
import numpy as np
df.assign(result = np.where(df['key2']=='one',df.data1,0))\
.groupby('key1').agg({'result':sum})
The advantage of this is that you can apply it to more than one column of the same dataframe
这样做的好处是您可以将其应用于同一数据帧的多列
df.assign(
result1 = np.where(df['key2']=='one',df.data1,0),
result2 = np.where(df['key2']=='two',df.data1,0)
).groupby('key1').agg({'result1':sum, 'result2':sum})
回答by jpp
You can filter your dataframe beforeyou perform your groupbyoperation. If this reduces your series index due to all values being out-of-scope, you can use reindexwith fillna:
您可以在执行groupby操作之前过滤数据框。如果由于所有值超出范围而减少了您的系列索引,您可以使用reindexwith fillna:
res = df.loc[df['key2'].eq('one')]\
.groupby('key1')['data1'].sum()\
.reindex(df['key1'].unique()).fillna(0)
print(res)
key1
a 3.631610
b 0.978738
c 0.000000
Name: data1, dtype: float64
Setup
设置
I have added an additional row for demonstration purposes.
为了演示目的,我添加了一个额外的行。
np.random.seed(0)
df = pd.DataFrame({'key1': ['a','a','b','b','a','c'],
'key2': ['one', 'two', 'one', 'two', 'one', 'two'],
'data1': np.random.randn(6),
'data2': np.random.randn(6)})

