Pandas groupby 应用与具有特定功能的转换

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/51079543/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:45:06  来源:igfitidea点击:

Pandas groupby apply vs transform with specific functions

pythonpandasdataframepandas-groupby

提问by jpp

I don't understand which functions are acceptable for groupby+ transformoperations. Often, I end up just guessing, testing, reverting until something works, but I feel there should be a systematic way of determining whether a solution will work.

我不明白groupby+transform操作可以接受哪些功能。通常,我最终只是猜测、测试、还原,直到某些事情起作用为止,但我觉得应该有一种系统的方法来确定解决方案是否有效。

Here's a minimal example. First let's use groupby+ applywith set:

这是一个最小的例子。首先,让我们使用groupby+applyset

df = pd.DataFrame({'a': [1,2,3,1,2,3,3], 'b':[1,2,3,1,2,3,3], 'type':[1,0,1,0,1,0,1]})

g = df.groupby(['a', 'b'])['type'].apply(set)

print(g)

a  b
1  1    {0, 1}
2  2    {0, 1}
3  3    {0, 1}

This works fine, but I want the resulting setcalculated groupwise in a new column of the original dataframe. So I try and use transform:

这工作正常,但我希望set在原始数据帧的新列中按分组计算结果。所以我尝试使用transform

df['g'] = df.groupby(['a', 'b'])['type'].transform(set)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
---> 23 df['g'] = df.groupby(['a', 'b'])['type'].transform(set)

TypeError: int() argument must be a string, a bytes-like object or a number, not 'set'

This is the error I see in Pandas v0.19.0. In v0.23.0, I see TypeError: 'set' type is unordered. Of course, I can map a specifically defined index to achieve my result:

这是我在 Pandas v0.19.0 中看到的错误。在 v0.23.0 中,我看到TypeError: 'set' type is unordered. 当然,我可以映射一个专门定义的索引来实现我的结果:

g = df.groupby(['a', 'b'])['type'].apply(set)
df['g'] = df.set_index(['a', 'b']).index.map(g.get)

print(df)

   a  b  type       g
0  1  1     1  {0, 1}
1  2  2     0  {0, 1}
2  3  3     1  {0, 1}
3  1  1     0  {0, 1}
4  2  2     1  {0, 1}
5  3  3     0  {0, 1}
6  3  3     1  {0, 1}

But I thought the benefit of transformwas to avoid such an explicit mapping. Where did I go wrong?

但我认为这样做的好处transform是避免了这种显式映射。我哪里做错了?

回答by rafaelc

I believe, in the first place, that there is some room for intuition in using these functions as they can be very meaningful.

我相信,首先,在使用这些函数时有一些直觉空间,因为它们可能非常有意义。

In your first result, you are not actually trying to transformyour values, but rather to aggregatethem (which would work in the way you intended).

在您的第一个结果中,您实际上并不是在尝试转换您的价值观,而是将它们聚合起来(这将按照您的预期工作)。

But getting into code, the transformdocs are quite suggestive in saying that

但是进入代码,transform文档非常有启发性地说

Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk.

返回与组块大小相同或可广播到组块大小的结果。

When you do

当你做

df.groupby(['a', 'b'])['type'].transform(some_func)

You are actually transforming each pd.Seriesobject from each groupinto a new object using your some_funcfunction. But the thing is, this new object should have the same size as the group ORbe broadcastable to the size of the chunk.

您实际上是在使用您的函数将每个组中的每个pd.Series对象转换为一个新对象some_func。但问题是,这个新对象应该与组具有相同的大小,或者可以广播到块的大小。

Therefore, if you transform your series using tupleor list, you will be basically transforming the object

因此,如果您使用tuple或转换您的系列list,您将基本上转换对象

0    1
1    2
2    3
dtype: int64

into

进入

[1,2,3]

But notice that these values are now assigned backto their respective indexes and that is why you see no difference in the transformoperation. The row that had the .iloc[0]value from the pd.Serieswill now have the [1,2,3][0]value from the transform list (the same would apply to tuple) etc. Notice that orderingand sizehere matters, because otherwise you could mess up your groups and the transform wouldn't work (and this is exactly why setis not a proper function to be used is this case).

但请注意,这些值现在已分配回它们各自的索引,这就是为什么您在transform操作中看不到差异的原因。具有.iloc[0]来自的值的行pd.Series现在将具有[1,2,3][0]来自转换列表的值(同样适用于元组)等。请注意,此处的排序大小很重要,否则您可能会弄乱您的组并且转换将不起作用(这正是为什么set在这种情况下不能使用正确的函数的原因)。



The second part of the quoted text says "broadcastable to the size of the group chunk".

引用文本的第二部分说“可广播到组块的大小”。

This means that you can also transform your pd.Seriesto an object that can be used in all rows. For example

这意味着您还可以将您pd.Series的对象转换为可在所有行中使用的对象。例如

df.groupby(['a', 'b'])['type'].transform(lambda k: 50)

would work. Why? even though 50is not iterable, it is broadcastableby using this value repeatedly in all positions of your initial pd.Series.

会工作。为什么?即使50是不迭代,它是broadcastable在你最初的所有位置反复使用这个值pd.Series



Why can you applyusing set?

为什么可以apply使用set?

Because the applymethod doesn't have this constraint of sizein the result. It actually has threedifferent result types, and it infers whether you want to expand, reduceor broadcastyour results. Notice that you can't reducein transforming*

因为该apply方法在结果中没有这种大小限制。它实际上具有三种不同的结果类型,它可以推断您是要扩展减少还是广播结果。请注意,您不能减少转换*

By default (result_type=None), the final return type is inferred from the return type of the applied function. result_type : {‘expand', ‘reduce', ‘broadcast', None}, default None These only act when axis=1(columns):

  1. ‘expand' : list-like results will be turned into columns.

  2. ‘reduce' : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand'.

  3. ‘broadcast' : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.

默认情况下 ( result_type=None),最终的返回类型是从应用函数的返回类型推断出来的。result_type : {'expand', 'reduce', 'broadcast', None}, default None 这些只在axis=1(列)时起作用:

  1. 'expand' : 类似列表的结果将变成列。

  2. 'reduce' :如果可能,返回一个系列而不是扩展类似列表的结果。这与“扩展”相反。

  3. 'broadcast' : 结果将广播到 DataFrame 的原始形状,原始索引和列将保留。

回答by igrinis

The result of the transformation is restricted to certain types. [For example it can't be list, set, Seriesetc. -- This is incorrect, thank you @RafaelC for comment]I don't think this is documented, but when examining the source code of groupby.pyand series.pyyou can find those type restrictions.

转换的结果仅限于某些类型。[例如,它不能listsetSeries等-这是不正确,谢谢@RafaelC发表评论]我不认为这是记录,但检查的源代码时groupby.pyseries.py你可以找到那些类型的限制。

From the groupbydocumentation

groupby文档

The transformmethod returns an object that is indexed the same (same size) as the one being grouped. The transform function must:

  • Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk(e.g., a scalar, grouped.transform(lambda x: x.iloc[-1])).
  • Operate column-by-column on the group chunk. The transform is applied to the first group chunk using chunk.apply.

  • Not perform in-place operations on the group chunk. Group chunks should be treated as immutable, and changes to a group chunk may produce unexpected results. For example, when using fillna, inplace must be False (grouped.transform(lambda x: x.fillna(inplace=False))).

  • (Optionally) operates on the entire group chunk. If this is supported, a fast path is used starting from the second chunk.

transform方法返回一个与被分组的对象索引相同(相同大小)的对象。变换函数必须:

  • 返回与组块大小相同或可广播到组块大小的结果(例如,标量,grouped.transform(lambda x: x.iloc[-1]))。
  • 在组块上逐列操作。使用 chunk.apply 将转换应用于第一个组块。

  • 不对组块执行就地操作。组块应该被视为不可变的,对组块的更改可能会产生意想不到的结果。例如,使用fillna 时,inplace 必须为False (grouped.transform(lambda x: x.fillna(inplace=False)))。

  • (可选)对整个组块进行操作。如果支持,则使用从第二个块开始的快速路径。

Disclaimer: I got different error (pandasversion 0.23.1):

免责声明:我遇到了不同的错误(pandas版本 0.23.1):

df['g'] = df.groupby(['a', 'b'])['type'].transform(set)
File "***/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 3661, in transform
s = klass(res, indexer)        s = klass(res, indexer)
File "***/lib/python3.6/site-packages/pandas/core/series.py", line 242, in __init__
"".format(data.__class__.__name__))
TypeError: 'set' type is unordered


Update

更新

After transforming the group into a set, pandascan't broadcast it to the Series, because it is unordered (and have different dimensions than the group chunk) . If we force it into a list it will became same size as the group chunk, and we get only single value per row. The answer is to wrap it around in some container, so the resulting size of the object will become 1, and then pandaswill be able to broadcast it:

将组转换为集合后,pandas无法将其广播到Series,因为它是无序的(并且与组块具有不同的维度)。如果我们将它强制放入一个列表中,它将变得与组块的大小相同,并且每行只能获得一个值。答案是将其包裹在某个容器中,这样生成的对象大小将变为 1,然后pandas就可以广播它了:

df['g'] = df.groupby(['a', 'b'])['type'].transform(lambda x: np.array(set(x)))
print(df)

   a  b  type       g
0  1  1     1  {0, 1}
1  2  2     0  {0, 1}
2  3  3     1  {0, 1}
3  1  1     0  {0, 1}
4  2  2     1  {0, 1}
5  3  3     0  {0, 1}
6  3  3     1  {0, 1}

Why I chose np.arrayas a container? Because series.py(line 205:206) pass this type without further checks. So I believe this behavior will be preserved in future versions.

为什么我选择np.array作为容器?因为series.py(第 205:206 行)无需进一步检查即可通过此类型。所以我相信这种行为会在未来的版本中保留。