在 Pandas 中,在 groupby 之后分组列消失了
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41658498/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
In Pandas, after groupby the grouped column is gone
提问by O. San
I have the following dataframe named ttm:
我有以下名为 ttm 的数据框:
usersidid clienthostid eventSumTotal LoginDaysSum score
0 12 1 60 3 1728
1 11 1 240 3 1331
3 5 1 5 3 125
4 6 1 16 2 216
2 10 3 270 3 1000
5 8 3 18 2 512
When i do
当我做
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
I get what I expected (though I would've wanted the results to be under a new label named 'ratio'):
我得到了我的预期(虽然我希望结果在一个名为“ratio”的新标签下):
clienthostid LoginDaysSum
0 1 4
1 3 2
But when I do
但是当我做
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])
I get:
我得到:
0 1.0
1 1.5
- Why did the labels go? I still also need the grouped need the 'clienthostid' and I need also the results of the apply to be under a label too
- Sometimes when I do groupby some of the other columns still appear, why is that that sometimes columns disappear and sometime stays? is there a flag I'm missing that do those stuff?
- In the example that I gave, when I did count the results showed on label 'LoginDaysSum', is there a why to add a new label for the results instead?
- 为什么标签消失了?我还需要分组需要“clienthostid”,我还需要将申请结果也放在标签下
- 有时当我执行 groupby 时,其他一些列仍然出现,为什么有时列消失有时保持?有没有我缺少的标志可以做这些事情?
- 在我给出的示例中,当我计算标签“LoginDaysSum”上显示的结果时,为什么要为结果添加新标签?
Thank you,
谢谢,
回答by jezrael
For return DataFrame
after groupby
are 2 possible solutions:
返回DataFrame
后groupby
有两种可能的解决方案:
parameter
as_index=False
what works nice withcount
,sum
,mean
functionsreset_index
for create new column from levels ofindex
, more general solution
参数
as_index=False
什么与count
,sum
,mean
函数配合使用reset_index
用于从 的级别创建新列index
,更通用的解决方案
df = ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
print (df)
clienthostid LoginDaysSum
0 1 4
1 3 2
df = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'].count().reset_index()
print (df)
clienthostid LoginDaysSum
0 1 4
1 3 2
For second need remove as_index=False
and instead add reset_index
:
对于第二个需要删除as_index=False
并添加reset_index
:
#output is `Series`
a = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'] \
.apply(lambda x: x.iloc[0] / x.iloc[1])
print (a)
clienthostid
1 1.0
3 1.5
Name: LoginDaysSum, dtype: float64
print (type(a))
<class 'pandas.core.series.Series'>
print (a.index)
Int64Index([1, 3], dtype='int64', name='clienthostid')
df1 = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum']
.apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
print (df1)
clienthostid ratio
0 1 1.0
1 3 1.5
Why some columns are gone?
为什么有些列不见了?
I think there can be problem automatic exclusion of nuisance columns:
我认为自动排除令人讨厌的列可能存在问题:
#convert column to str
ttm.usersidid = ttm.usersidid.astype(str) + 'aa'
print (ttm)
usersidid clienthostid eventSumTotal LoginDaysSum score
0 12aa 1 60 3 1728
1 11aa 1 240 3 1331
3 5aa 1 5 3 125
4 6aa 1 16 2 216
2 10aa 3 270 3 1000
5 8aa 3 18 2 512
#removed str column userid
a = ttm.groupby(['clienthostid'], sort=False).sum()
print (a)
eventSumTotal LoginDaysSum score
clienthostid
1 321 11 3400
3 288 5 1512
回答by piRSquared
count
is a built in method for the groupby
object and pandas knows what to do with it. There are two other things specified that goes into determining what the out put looks like.
count
是groupby
对象的内置方法,pandas 知道如何处理它。还指定了另外两件事情来确定输出的样子。
# For a built in method, when
# you don't want the group column
# as the index, pandas keeps it in
# as a column.
# |----||||----|
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
clienthostid LoginDaysSum
0 1 4
1 3 2
# For a built in method, when
# you do want the group column
# as the index, then...
# |----||||---|
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].count()
# |-----||||-----|
# the single brackets tells
# pandas to operate on a series
# in this case, count the series
clienthostid
1 4
3 2
Name: LoginDaysSum, dtype: int64
ttm.groupby(['clienthostid'], as_index=True, sort=False)[['LoginDaysSum']].count()
# |------||||------|
# the double brackets tells pandas
# to operate on the dataframe
# specified by these columns and will
# return a dataframe
LoginDaysSum
clienthostid
1 4
3 2
When you used apply
pandas no longer knows what to do with the group column when you say as_index=False
. It has to trust that if you use apply
you want returned exactly what you say to return, so it will just throw it away. Also, you have single brackets around your column which says to operate on a series. Instead, use as_index=True
to keep the grouping column information in the index. Then follow it up with a reset_index
to transfer it from the index back into the dataframe. At this point, it will not have mattered that you used single brackets because after the reset_index
you'll have a dataframe again.
当你使用apply
Pandas 时,不再知道如何处理组列时你说as_index=False
. 它必须相信,如果你使用apply
你想要返回的正是你所说的返回,所以它只会把它扔掉。此外,您的列周围有单个括号,表示对系列进行操作。相反,用于as_index=True
将分组列信息保留在索引中。然后使用 areset_index
将其从索引传输回数据帧。在这一点上,您使用单括号无关紧要,因为之后reset_index
您将再次拥有一个数据框。
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])
0 1.0
1 1.5
dtype: float64
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()
clienthostid LoginDaysSum
0 1 1.0
1 3 1.5
回答by the_RR
Reading the groupy documentarion, a found out that automatic exclusion of columns after groupby usually caused by the presence of null values in that columns excluded.
阅读groupy 文档,发现 groupby 后自动排除列通常是由排除的列中存在空值引起的。
Try fill the 'null' with some value.
尝试用一些值填充“空”。
Like this:
像这样:
df.fillna('')