pandas groupby-apply 行为,返回一个系列(不一致的输出类型)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37715246/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas groupby-apply behavior, returning a Series (inconsistent output type)
提问by Victor Chubukov
I'm curious about the behavior of pandas groupby-apply when the apply function returns a series.
我很好奇当 apply 函数返回一个系列时 pandas groupby-apply 的行为。
When the series are of different lengths, it returns a multi-indexed series.
当系列长度不同时,它返回一个多索引系列。
In [1]: import pandas as pd
In [2]: df1=pd.DataFrame({'state':list("AABBB"),
...: 'city':list("vwxyz")})
In [3]: df1
Out[3]:
city state
0 v A
1 w A
2 x B
3 y B
4 z B
In [4]: def f(x):
...: return pd.Series(x['city'].values,index=range(len(x)))
...:
In [5]: df1.groupby('state').apply(f)
Out[5]:
state
A 0 v
1 w
B 0 x
1 y
2 z
dtype: object
This returns a a Series
object.
这将返回一个Series
对象。
However, if every series has the same length, then it pivots this into a DataFrame
.
但是,如果每个系列都具有相同的长度,那么它会将其转为DataFrame
.
In [6]: df2=pd.DataFrame({'state':list("AAABBB"),
...: 'city':list("uvwxyz")})
In [7]: df2
Out[7]:
city state
0 u A
1 v A
2 w A
3 x B
4 y B
5 z B
In [8]: df2.groupby('state').apply(f)
Out[8]:
0 1 2
state
A u v w
B x y z
Is this really the intended behavior? Are we meant to check the return type if we use apply this way? Or is there an option in apply
that I'm not appreciating?
这真的是预期的行为吗?如果我们以这种方式使用应用,我们是否打算检查返回类型?或者有apply
没有我不欣赏的选择?
In case you're curious, in my actual use case, the returned Series will be the same length as the length of the group. It seems like an ideal case for transform
except that I've found that apply
with returning a Series is actually an order of magnitude faster on a large dataset. That can be another topic.
如果您好奇,在我的实际用例中,返回的系列的长度将与组的长度相同。这似乎是一个理想的情况,transform
除了我发现apply
在大型数据集上返回 Series 实际上要快一个数量级。那可以是另一个话题。
Edit: Loosely based on the Parfait's answer, we can certainly do this:
编辑:根据 Parfait 的回答,我们当然可以这样做:
X=df.groupby('state').apply(f)
if not isinstance(X,pd.Series):
X=X.stack()
X
That will give the same output type for either df=df1
or df=df2
. I guess I'm just asking if this is really the normal or preferred way to handle this.
这将使相同的输出类型,无论是df=df1
或df=df2
。我想我只是在问这是否真的是处理这个问题的正常或首选方式。
回答by Parfait
In essence, a dataframe consists of equal-length series (technically a dictionary container of Series objects). As stated in the pandas split-apply-combinedocs, running a groupby()refers to one or more of the following
本质上,数据帧由等长系列(技术上是系列对象的字典容器)组成。如 pandas split-apply-combine文档中所述,运行groupby()是指以下一项或多项
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure
- 根据某些标准将数据分组
- 对每个组独立应用一个函数
- 将结果组合成数据结构
Notice this does not state a data frame is always produced, but a generalized data structure. So a groupby()
operation can downcast to a Series, or if given a Series as input, can upcast to dataframe.
请注意,这并不是说始终生成数据框,而是表示通用数据结构。因此,groupby()
操作可以向下转换为系列,或者如果给定系列作为输入,则可以向上转换为数据帧。
For your first dataframe, you run unequal groupings (or unequal index lengths) coercing a series return which in the "combine" processing does not adequately yield a data frame. Since a data frame cannot combine different length series it instead yields a multi-index series. You can see this with print statements in the defined function with the state==A
group having length 2 and B
group length 3.
对于您的第一个数据帧,您运行不相等的分组(或不相等的索引长度),以强制在“组合”处理中无法充分产生数据帧的系列返回。由于数据框不能组合不同长度的系列,它会产生一个多索引系列。您可以使用已定义函数中的打印语句看到这一点,其中state==A
组的长度为 2,B
组的长度为 3。
def f(x):
print(x)
return pd.Series(x['city'].values, index=range(len(x)))
s1 = df1.groupby('state').apply(f)
print(s1)
# city state
# 0 v A
# 1 w A
# city state
# 0 v A
# 1 w A
# city state
# 2 x B
# 3 y B
# 4 z B
# state
# A 0 v
# 1 w
# B 0 x
# 1 y
# 2 z
# dtype: object
However, you can manipulate the multi-index series outcome by resetting index and thereby adjusting its hierarchical levels:
但是,您可以通过重置索引从而调整其层次结构来操作多索引系列结果:
df = df1.groupby('state').apply(f).reset_index()
print(df)
# state level_1 0
# 0 A 0 v
# 1 A 1 w
# 2 B 0 x
# 3 B 1 y
# 4 B 2 z
But more relevant to your needs is unstack()which pivots a level of the index labels, yielding a data frame. Consider fillna()
to fill the None
outcome.
但与您的需求更相关的是unstack(),它旋转索引标签的级别,产生一个数据框。考虑fillna()
填充None
结果。
df = df1.groupby('state').apply(f).unstack()
print(df)
# 0 1 2
# state
# A v w None
# B x y z
回答by user3582076
instead of doing index=range(len(x))
in your function f, you
can do index=x.index
to prevent this undesired behavior
而不是index=range(len(x))
在你的函数 f中做,你可以做index=x.index
来防止这种不受欢迎的行为