pandas groupby-apply 行为，返回一个系列（不一致的输出类型）

Question

提问by Victor Chubukov

I'm curious about the behavior of pandas groupby-apply when the apply function returns a series.

我很好奇当 apply 函数返回一个系列时 pandas groupby-apply 的行为。

When the series are of different lengths, it returns a multi-indexed series.

当系列长度不同时，它返回一个多索引系列。

In [1]: import pandas as pd

In [2]: df1=pd.DataFrame({'state':list("AABBB"),
   ...:                 'city':list("vwxyz")})

In [3]: df1
Out[3]:
  city state
0    v     A
1    w     A
2    x     B
3    y     B
4    z     B

In [4]: def f(x):
   ...:         return pd.Series(x['city'].values,index=range(len(x)))
   ...:

In [5]: df1.groupby('state').apply(f)
Out[5]:
state
A      0    v
       1    w
B      0    x
       1    y
       2    z
dtype: object

This returns a a Seriesobject.

这将返回一个Series对象。

However, if every series has the same length, then it pivots this into a DataFrame.

但是，如果每个系列都具有相同的长度，那么它会将其转为DataFrame.

In [6]: df2=pd.DataFrame({'state':list("AAABBB"),
   ...:                 'city':list("uvwxyz")})

In [7]: df2
Out[7]:
  city state
0    u     A
1    v     A
2    w     A
3    x     B
4    y     B
5    z     B

In [8]: df2.groupby('state').apply(f)
Out[8]:
       0  1  2
state
A      u  v  w
B      x  y  z

Is this really the intended behavior? Are we meant to check the return type if we use apply this way? Or is there an option in applythat I'm not appreciating?

这真的是预期的行为吗？如果我们以这种方式使用应用，我们是否打算检查返回类型？或者有apply没有我不欣赏的选择？

In case you're curious, in my actual use case, the returned Series will be the same length as the length of the group. It seems like an ideal case for transformexcept that I've found that applywith returning a Series is actually an order of magnitude faster on a large dataset. That can be another topic.

如果您好奇，在我的实际用例中，返回的系列的长度将与组的长度相同。这似乎是一个理想的情况，transform除了我发现apply在大型数据集上返回 Series 实际上要快一个数量级。那可以是另一个话题。

Edit: Loosely based on the Parfait's answer, we can certainly do this:

编辑：根据 Parfait 的回答，我们当然可以这样做：

X=df.groupby('state').apply(f)
if not isinstance(X,pd.Series):
    X=X.stack()
X

That will give the same output type for either df=df1or df=df2. I guess I'm just asking if this is really the normal or preferred way to handle this.

这将使相同的输出类型，无论是df=df1或df=df2。我想我只是在问这是否真的是处理这个问题的正常或首选方式。

Answer 1

回答by Parfait

In essence, a dataframe consists of equal-length series (technically a dictionary container of Series objects). As stated in the pandas split-apply-combinedocs, running a groupby()refers to one or more of the following

本质上，数据帧由等长系列（技术上是系列对象的字典容器）组成。如 pandas split-apply-combine文档中所述，运行groupby()是指以下一项或多项

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

根据某些标准将数据分组
对每个组独立应用一个函数
将结果组合成数据结构

Notice this does not state a data frame is always produced, but a generalized data structure. So a groupby()operation can downcast to a Series, or if given a Series as input, can upcast to dataframe.

请注意，这并不是说始终生成数据框，而是表示通用数据结构。因此，groupby()操作可以向下转换为系列，或者如果给定系列作为输入，则可以向上转换为数据帧。

For your first dataframe, you run unequal groupings (or unequal index lengths) coercing a series return which in the "combine" processing does not adequately yield a data frame. Since a data frame cannot combine different length series it instead yields a multi-index series. You can see this with print statements in the defined function with the state==Agroup having length 2 and Bgroup length 3.

对于您的第一个数据帧，您运行不相等的分组（或不相等的索引长度），以强制在“组合”处理中无法充分产生数据帧的系列返回。由于数据框不能组合不同长度的系列，它会产生一个多索引系列。您可以使用已定义函数中的打印语句看到这一点，其中state==A组的长度为 2，B组的长度为 3。

def f(x):
    print(x)
    return pd.Series(x['city'].values, index=range(len(x)))

s1 = df1.groupby('state').apply(f)

print(s1)
#   city state
# 0    v     A
# 1    w     A
#   city state
# 0    v     A
# 1    w     A
#   city state
# 2    x     B
# 3    y     B
# 4    z     B
# state   
# A      0    v
#        1    w
# B      0    x
#        1    y
#        2    z
# dtype: object

However, you can manipulate the multi-index series outcome by resetting index and thereby adjusting its hierarchical levels:

但是，您可以通过重置索引从而调整其层次结构来操作多索引系列结果：

df = df1.groupby('state').apply(f).reset_index()
print(df)

#   state  level_1  0
# 0     A        0  v
# 1     A        1  w
# 2     B        0  x
# 3     B        1  y
# 4     B        2  z

But more relevant to your needs is unstack()which pivots a level of the index labels, yielding a data frame. Consider fillna()to fill the Noneoutcome.

但与您的需求更相关的是unstack()，它旋转索引标签的级别，产生一个数据框。考虑fillna()填充None结果。

df = df1.groupby('state').apply(f).unstack()
print(df)

#        0  1     2
# state            
# A      v  w  None
# B      x  y     z

Answer 2

回答by user3582076

instead of doing index=range(len(x))in your function f, you can do index=x.indexto prevent this undesired behavior

而不是index=range(len(x))在你的函数 f中做，你可以做index=x.index来防止这种不受欢迎的行为

pandas groupby-apply 行为，返回一个系列（不一致的输出类型）

提问by Victor Chubukov

回答by Parfait

回答by user3582076

相关推荐

最近更新

标签

pandas groupby-apply 行为，返回一个系列（不一致的输出类型）

提问by Victor Chubukov

回答by Parfait

回答by user3582076

相关推荐

pandas 从pandas groupby中的每组中选择前n个元素

pandas 熊猫中的多处理

pandas 如何在 python 中为熊猫创建“非”过滤器

pandas 在列匹配特定值的数据框中获取整数行索引

相关推荐

最近更新

标签