Pandas:条件组特定计算

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24250832/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:10:10  来源:igfitidea点击:

Pandas: conditional group-specific computations

pythonpandas

提问by kpax

Let's say I have a table with a key (e.g. customer ID) and two numeric columns C1 and C2. I would like to group rows by the key (customer) and run some aggregators like sum and mean on its columns. After computing group aggregators I would like to assign the results back to each customer row in a DataFrame (as some customer-wide features added to each row).

假设我有一个带有键(例如客户 ID)和两个数字列 C1 和 C2 的表。我想按键(客户)对行进行分组,并在其列上运行一些聚合器,如 sum 和 mean。在计算组聚合器之后,我想将结果分配回 DataFrame 中的每个客户行(因为一些客户范围的功能添加到每一行)。

I can see that I can do something like
df['F1'] = df.groupby(['Key'])['C1'].transform(np.sum)
if I want to aggregate just one column and be able to add the result back to the DataFrame.

我可以看到我可以做一些事情,
df['F1'] = df.groupby(['Key'])['C1'].transform(np.sum)
如果我只想聚合一列并且能够将结果添加回 DataFrame。

Can I make it conditional - can I add up C1 column in a group only for rows whose C2 column is equal to some number X and still be able to add results back to the DataFrame?

我可以将其设置为有条件的 - 我可以只为 C2 列等于某个数字 X 的行在组中添加 C1 列,并且仍然能够将结果添加回 DataFrame 吗?

How can I run aggregator on a combination of rows like:
np.sum(C1 + C2)?

如何在行组合上运行聚合器,例如:
np.sum(C1 + C2)?

What would be the simplest and most elegant way to implement it? What is the most efficient way to do it? Can those aggregations be done in a one path?

实现它的最简单和最优雅的方法是什么?最有效的方法是什么?这些聚合可以在一条路径上完成吗?

Thank you in advance.

先感谢您。

采纳答案by chrisb

Here's some setup of some dummy data.

这是一些虚拟数据的一些设置。

In [81]: df = pd.DataFrame({'Key': ['a','a','b','b','c','c'], 
                            'C1': [1,2,3,4,5,6],  
                            'C2': [7,8,9,10,11,12]})
In [82]: df['F1'] = df.groupby('Key')['C1'].transform(np.sum)

In [83]: df
Out[83]: 
   C1  C2 Key  F1
0   1   7   a   3
1   2   8   a   3
2   3   9   b   7
3   4  10   b   7
4   5  11   c  11
5   6  12   c  11

If you want to do a conditional GroupBy, you can just filter the dataframe as it's passed to .groubpy. For example, if you wanted the group sum of 'C1' if C2 is less than 8 or greater than 9.

如果你想做一个有条件的 GroupBy,你可以在数据帧传递给.groubpy. 例如,如果您想要 C2 小于 8 或大于 9 的 'C1' 的组总和。

In [87]: cond = (df['C2'] < 8) | (df['C2'] > 9)

In [88]: df['F2'] = df[cond].groupby('Key')['C1'].transform(np.sum)

In [89]: df
Out[89]: 
   C1  C2 Key  F1  F2
0   1   7   a   3   1
1   2   8   a   3 NaN
2   3   9   b   7 NaN
3   4  10   b   7   4
4   5  11   c  11  11
5   6  12   c  11  11

This works because the transformoperation preserves the index, so it will still align with the original dataframe correctly.

这是有效的,因为该transform操作保留了索引,因此它仍然会与原始数据帧正确对齐。

If you want to sum the group totals for two columns, probably easiest to do something like this? Someone may have something more clever.

如果你想对两列的组总数求和,可能最容易做这样的事情?有人可能有更聪明的东西。

In [93]: gb = df.groupby('Key')

In [94]: df['C1+C2'] = gb['C1'].transform(np.sum) + gb['C2'].transform(np.sum)

Edit: Here's one other way to get group totals for multiple columns. The syntax isn't really any cleaner, but may be more convenient for a large number of a columns.

编辑:这是获取多列的组总数的另一种方法。语法并不是更简洁,但对于大量列可能更方便。

df['C1_C2'] = gb[['C1','C2']].apply(lambda x: pd.DataFrame(x.sum().sum(), index=x.index, columns=['']))

回答by kpax

I found another approach that uses apply() instead of transform(), but you need to join the result table with the input DataFrame and I just haven't figured out yet how to do it. Would appreciate help to finish the table joining part or any better alternatives.

我发现了另一种使用 apply() 而不是 transform() 的方法,但是您需要将结果表与输入 DataFrame 连接起来,我只是还没有想出如何去做。将不胜感激帮助完成表连接部分或任何更好的替代品。

df = pd.DataFrame({'Key': ['a','a','b','b','c','c'],
                            'C1': [1,2,3,4,5,6],
                            'C2': [7,8,9,10,11,12]})

# Group g will be given as a DataFrame
def group_feature_extractor(g):
    feature_1 = (g['C1'] + g['C2']).sum()

    even_C1_filter = g['C1'] % 2 == 0
    feature_2 = g[even_C1_filter]['C2'].sum()

    return pd.Series([feature_1, feature_2], index = ['F1', 'F2'])

# Group once
group = df.groupby(['Key'])

# Extract features from each group
group_features = group.apply(group_feature_extractor)

#
# Join with the input data frame ...
#
df = pd.DataFrame({'Key': ['a','a','b','b','c','c'],
                            'C1': [1,2,3,4,5,6],
                            'C2': [7,8,9,10,11,12]})

# Group g will be given as a DataFrame
def group_feature_extractor(g):
    feature_1 = (g['C1'] + g['C2']).sum()

    even_C1_filter = g['C1'] % 2 == 0
    feature_2 = g[even_C1_filter]['C2'].sum()

    return pd.Series([feature_1, feature_2], index = ['F1', 'F2'])

# Group once
group = df.groupby(['Key'])

# Extract features from each group
group_features = group.apply(group_feature_extractor)

#
# Join with the input data frame ...
#