Pandas groupby 应用与具有特定功能的转换
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/51079543/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas groupby apply vs transform with specific functions
提问by jpp
I don't understand which functions are acceptable for groupby
+ transform
operations. Often, I end up just guessing, testing, reverting until something works, but I feel there should be a systematic way of determining whether a solution will work.
我不明白groupby
+transform
操作可以接受哪些功能。通常,我最终只是猜测、测试、还原,直到某些事情起作用为止,但我觉得应该有一种系统的方法来确定解决方案是否有效。
Here's a minimal example. First let's use groupby
+ apply
with set
:
这是一个最小的例子。首先,让我们使用groupby
+apply
有set
:
df = pd.DataFrame({'a': [1,2,3,1,2,3,3], 'b':[1,2,3,1,2,3,3], 'type':[1,0,1,0,1,0,1]})
g = df.groupby(['a', 'b'])['type'].apply(set)
print(g)
a b
1 1 {0, 1}
2 2 {0, 1}
3 3 {0, 1}
This works fine, but I want the resulting set
calculated groupwise in a new column of the original dataframe. So I try and use transform
:
这工作正常,但我希望set
在原始数据帧的新列中按分组计算结果。所以我尝试使用transform
:
df['g'] = df.groupby(['a', 'b'])['type'].transform(set)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
---> 23 df['g'] = df.groupby(['a', 'b'])['type'].transform(set)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'set'
This is the error I see in Pandas v0.19.0. In v0.23.0, I see TypeError: 'set' type is unordered
. Of course, I can map a specifically defined index to achieve my result:
这是我在 Pandas v0.19.0 中看到的错误。在 v0.23.0 中,我看到TypeError: 'set' type is unordered
. 当然,我可以映射一个专门定义的索引来实现我的结果:
g = df.groupby(['a', 'b'])['type'].apply(set)
df['g'] = df.set_index(['a', 'b']).index.map(g.get)
print(df)
a b type g
0 1 1 1 {0, 1}
1 2 2 0 {0, 1}
2 3 3 1 {0, 1}
3 1 1 0 {0, 1}
4 2 2 1 {0, 1}
5 3 3 0 {0, 1}
6 3 3 1 {0, 1}
But I thought the benefit of transform
was to avoid such an explicit mapping. Where did I go wrong?
但我认为这样做的好处transform
是避免了这种显式映射。我哪里做错了?
回答by rafaelc
I believe, in the first place, that there is some room for intuition in using these functions as they can be very meaningful.
我相信,首先,在使用这些函数时有一些直觉空间,因为它们可能非常有意义。
In your first result, you are not actually trying to transformyour values, but rather to aggregatethem (which would work in the way you intended).
在您的第一个结果中,您实际上并不是在尝试转换您的价值观,而是将它们聚合起来(这将按照您的预期工作)。
But getting into code, the transform
docs are quite suggestive in saying that
但是进入代码,transform
文档非常有启发性地说
Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk.
返回与组块大小相同或可广播到组块大小的结果。
When you do
当你做
df.groupby(['a', 'b'])['type'].transform(some_func)
You are actually transforming each pd.Series
object from each groupinto a new object using your some_func
function. But the thing is, this new object should have the same size as the group ORbe broadcastable to the size of the chunk.
您实际上是在使用您的函数将每个组中的每个pd.Series
对象转换为一个新对象some_func
。但问题是,这个新对象应该与组具有相同的大小,或者可以广播到块的大小。
Therefore, if you transform your series using tuple
or list
, you will be basically transforming the object
因此,如果您使用tuple
或转换您的系列list
,您将基本上转换对象
0 1
1 2
2 3
dtype: int64
into
进入
[1,2,3]
But notice that these values are now assigned backto their respective indexes and that is why you see no difference in the transform
operation. The row that had the .iloc[0]
value from the pd.Series
will now have the [1,2,3][0]
value from the transform list (the same would apply to tuple) etc. Notice that orderingand sizehere matters, because otherwise you could mess up your groups and the transform wouldn't work (and this is exactly why set
is not a proper function to be used is this case).
但请注意,这些值现在已分配回它们各自的索引,这就是为什么您在transform
操作中看不到差异的原因。具有.iloc[0]
来自的值的行pd.Series
现在将具有[1,2,3][0]
来自转换列表的值(同样适用于元组)等。请注意,此处的排序和大小很重要,否则您可能会弄乱您的组并且转换将不起作用(这正是为什么set
在这种情况下不能使用正确的函数的原因)。
The second part of the quoted text says "broadcastable to the size of the group chunk".
引用文本的第二部分说“可广播到组块的大小”。
This means that you can also transform your pd.Series
to an object that can be used in all rows. For example
这意味着您还可以将您pd.Series
的对象转换为可在所有行中使用的对象。例如
df.groupby(['a', 'b'])['type'].transform(lambda k: 50)
would work. Why? even though 50
is not iterable, it is broadcastableby using this value repeatedly in all positions of your initial pd.Series
.
会工作。为什么?即使50
是不迭代,它是broadcastable在你最初的所有位置反复使用这个值pd.Series
。
Why can you apply
using set?
为什么可以apply
使用set?
Because the apply
method doesn't have this constraint of sizein the result. It actually has threedifferent result types, and it infers whether you want to expand, reduceor broadcastyour results. Notice that you can't reducein transforming*
因为该apply
方法在结果中没有这种大小限制。它实际上具有三种不同的结果类型,它可以推断您是要扩展、减少还是广播结果。请注意,您不能减少转换*
By default (
result_type=None
), the final return type is inferred from the return type of the applied function. result_type : {‘expand', ‘reduce', ‘broadcast', None}, default None These only act whenaxis=1
(columns):
‘expand' : list-like results will be turned into columns.
‘reduce' : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand'.
‘broadcast' : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.
默认情况下 (
result_type=None
),最终的返回类型是从应用函数的返回类型推断出来的。result_type : {'expand', 'reduce', 'broadcast', None}, default None 这些只在axis=1
(列)时起作用:
'expand' : 类似列表的结果将变成列。
'reduce' :如果可能,返回一个系列而不是扩展类似列表的结果。这与“扩展”相反。
'broadcast' : 结果将广播到 DataFrame 的原始形状,原始索引和列将保留。
回答by igrinis
The result of the transformation is restricted to certain types. [For example it can't be list
, set
, Series
etc. -- This is incorrect, thank you @RafaelC for comment]I don't think this is documented, but when examining the source code of groupby.py
and series.py
you can find those type restrictions.
转换的结果仅限于某些类型。[例如,它不能list
,set
,Series
等-这是不正确,谢谢@RafaelC发表评论]我不认为这是记录,但检查的源代码时groupby.py
和series.py
你可以找到那些类型的限制。
From the groupby
documentation
从groupby
文档
The
transform
method returns an object that is indexed the same (same size) as the one being grouped. The transform function must:
- Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk(e.g., a scalar, grouped.transform(lambda x: x.iloc[-1])).
Operate column-by-column on the group chunk. The transform is applied to the first group chunk using chunk.apply.
Not perform in-place operations on the group chunk. Group chunks should be treated as immutable, and changes to a group chunk may produce unexpected results. For example, when using fillna, inplace must be False (grouped.transform(lambda x: x.fillna(inplace=False))).
(Optionally) operates on the entire group chunk. If this is supported, a fast path is used starting from the second chunk.
该
transform
方法返回一个与被分组的对象索引相同(相同大小)的对象。变换函数必须:
- 返回与组块大小相同或可广播到组块大小的结果(例如,标量,grouped.transform(lambda x: x.iloc[-1]))。
在组块上逐列操作。使用 chunk.apply 将转换应用于第一个组块。
不对组块执行就地操作。组块应该被视为不可变的,对组块的更改可能会产生意想不到的结果。例如,使用fillna 时,inplace 必须为False (grouped.transform(lambda x: x.fillna(inplace=False)))。
(可选)对整个组块进行操作。如果支持,则使用从第二个块开始的快速路径。
Disclaimer: I got different error (pandas
version 0.23.1):
免责声明:我遇到了不同的错误(pandas
版本 0.23.1):
df['g'] = df.groupby(['a', 'b'])['type'].transform(set)
File "***/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 3661, in transform
s = klass(res, indexer) s = klass(res, indexer)
File "***/lib/python3.6/site-packages/pandas/core/series.py", line 242, in __init__
"".format(data.__class__.__name__))
TypeError: 'set' type is unordered
Update
更新
After transforming the group into a set, pandas
can't broadcast it to the Series
, because it is unordered (and have different dimensions than the group chunk) . If we force it into a list it will became same size as the group chunk, and we get only single value per row. The answer is to wrap it around in some container, so the resulting size of the object will become 1, and then pandas
will be able to broadcast it:
将组转换为集合后,pandas
无法将其广播到Series
,因为它是无序的(并且与组块具有不同的维度)。如果我们将它强制放入一个列表中,它将变得与组块的大小相同,并且每行只能获得一个值。答案是将其包裹在某个容器中,这样生成的对象大小将变为 1,然后pandas
就可以广播它了:
df['g'] = df.groupby(['a', 'b'])['type'].transform(lambda x: np.array(set(x)))
print(df)
a b type g
0 1 1 1 {0, 1}
1 2 2 0 {0, 1}
2 3 3 1 {0, 1}
3 1 1 0 {0, 1}
4 2 2 1 {0, 1}
5 3 3 0 {0, 1}
6 3 3 1 {0, 1}
Why I chose np.array
as a container? Because series.py
(line 205:206) pass this type without further checks. So I believe this behavior will be preserved in future versions.
为什么我选择np.array
作为容器?因为series.py
(第 205:206 行)无需进一步检查即可通过此类型。所以我相信这种行为会在未来的版本中保留。