带 Groupby 的 Python Pandas 条件求和

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17266129/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:54:52  来源:igfitidea点击:

Python Pandas Conditional Sum with Groupby

pythonpandaspandas-groupby

提问by AllenQ

Using sample data:

使用样本数据:

df = pd.DataFrame({'key1' : ['a','a','b','b','a'],
               'key2' : ['one', 'two', 'one', 'two', 'one'],
               'data1' : np.random.randn(5),
               'data2' : np. random.randn(5)})

df

df

    data1        data2     key1  key2
0    0.361601    0.375297    a   one
1    0.069889    0.809772    a   two
2    1.468194    0.272929    b   one
3   -1.138458    0.865060    b   two
4   -0.268210    1.250340    a   one

I'm trying to figure out how to group the data by key1 and sum only the data1 values where key2 equals 'one'.

我试图弄清楚如何按 key1 对数据进行分组,并仅对 key2 等于“一”的 data1 值求和。

Here's what I've tried

这是我尝试过的

def f(d,a,b):
    d.ix[d[a] == b, 'data1'].sum()

df.groupby(['key1']).apply(f, a = 'key2', b = 'one').reset_index()

But this gives me a dataframe with 'None' values

但这给了我一个带有“无”值的数据框

index   key1    0
0       a       None
1       b       None

Any ideas here? I'm looking for the Pandas equivalent of the following SQL:

这里有什么想法吗?我正在寻找与以下 SQL 等效的 Pandas:

SELECT Key1, SUM(CASE WHEN Key2 = 'one' then data1 else 0 end)
FROM df
GROUP BY key1

FYI - I've seen conditional sums for pandas aggregatebut couldn't transform the answer provided there to work with sums rather than counts.

仅供参考 - 我已经看到了 Pandas 聚合的条件总和,但无法转换那里提供的答案以使用总和而不是计数。

Thanks in advance

提前致谢

采纳答案by Andy Hayden

First groupby the key1 column:

第一个 groupby key1 列:

In [11]: g = df.groupby('key1')

and then for each group take the subDataFrame where key2 equals 'one' and sum the data1 column:

然后对于每个组取 subDataFrame 其中 key2 等于 'one' 并对 data1 列求和:

In [12]: g.apply(lambda x: x[x['key2'] == 'one']['data1'].sum())
Out[12]:
key1
a       0.093391
b       1.468194
dtype: float64

To explain what's going on let's look at the 'a' group:

为了解释发生了什么,让我们看看“a”组:

In [21]: a = g.get_group('a')

In [22]: a
Out[22]:
      data1     data2 key1 key2
0  0.361601  0.375297    a  one
1  0.069889  0.809772    a  two
4 -0.268210  1.250340    a  one

In [23]: a[a['key2'] == 'one']
Out[23]:
      data1     data2 key1 key2
0  0.361601  0.375297    a  one
4 -0.268210  1.250340    a  one

In [24]: a[a['key2'] == 'one']['data1']
Out[24]:
0    0.361601
4   -0.268210
Name: data1, dtype: float64

In [25]: a[a['key2'] == 'one']['data1'].sum()
Out[25]: 0.093391000000000002

It may be slightly easier/clearer to do this by restricting the dataframe to just those with key2 equals one first:

通过将数据帧限制为 key2 首先等于 1 的数据帧,这样做可能会更容易/更清晰:

In [31]: df1 = df[df['key2'] == 'one']

In [32]: df1
Out[32]:
      data1     data2 key1 key2
0  0.361601  0.375297    a  one
2  1.468194  0.272929    b  one
4 -0.268210  1.250340    a  one

In [33]: df1.groupby('key1')['data1'].sum()
Out[33]:
key1
a       0.093391
b       1.468194
Name: data1, dtype: float64

回答by Diego

I think that today with pandas 0.23 you can do this:

我认为今天使用 pandas 0.23 你可以做到这一点:

import numpy as np

 df.assign(result = np.where(df['key2']=='one',df.data1,0))\
   .groupby('key1').agg({'result':sum})

The advantage of this is that you can apply it to more than one column of the same dataframe

这样做的好处是您可以将其应用于同一数据帧的多列

df.assign(
 result1 = np.where(df['key2']=='one',df.data1,0),
 result2 = np.where(df['key2']=='two',df.data1,0)
  ).groupby('key1').agg({'result1':sum, 'result2':sum})

回答by jpp

You can filter your dataframe beforeyou perform your groupbyoperation. If this reduces your series index due to all values being out-of-scope, you can use reindexwith fillna:

您可以执行groupby操作之前过滤数据框。如果由于所有值超出范围而减少了您的系列索引,您可以使用reindexwith fillna

res = df.loc[df['key2'].eq('one')]\
        .groupby('key1')['data1'].sum()\
        .reindex(df['key1'].unique()).fillna(0)

print(res)

key1
a    3.631610
b    0.978738
c    0.000000
Name: data1, dtype: float64

Setup

设置

I have added an additional row for demonstration purposes.

为了演示目的,我添加了一个额外的行。

np.random.seed(0)

df = pd.DataFrame({'key1': ['a','a','b','b','a','c'],
                   'key2': ['one', 'two', 'one', 'two', 'one', 'two'],
                   'data1': np.random.randn(6),
                   'data2': np.random.randn(6)})