在 Pandas 中,在 groupby 之后分组列消失了

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41658498/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:46:38  来源:igfitidea点击:

In Pandas, after groupby the grouped column is gone

pythonpandas

提问by O. San

I have the following dataframe named ttm:

我有以下名为 ttm 的数据框:

    usersidid   clienthostid    eventSumTotal   LoginDaysSum    score
0       12          1               60              3           1728
1       11          1               240             3           1331
3       5           1               5               3           125
4       6           1               16              2           216
2       10          3               270             3           1000
5       8           3               18              2           512

When i do

当我做

ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()

I get what I expected (though I would've wanted the results to be under a new label named 'ratio'):

我得到了我的预期(虽然我希望结果在一个名为“ratio”的新标签下):

       clienthostid  LoginDaysSum
0             1          4
1             3          2

But when I do

但是当我做

ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])

I get:

我得到:

0    1.0
1    1.5
  1. Why did the labels go? I still also need the grouped need the 'clienthostid' and I need also the results of the apply to be under a label too
  2. Sometimes when I do groupby some of the other columns still appear, why is that that sometimes columns disappear and sometime stays? is there a flag I'm missing that do those stuff?
  3. In the example that I gave, when I did count the results showed on label 'LoginDaysSum', is there a why to add a new label for the results instead?
  1. 为什么标签消失了?我还需要分组需要“clienthostid”,我还需要将申请结果也放在标签下
  2. 有时当我执行 groupby 时,其他一些列仍然出现,为什么有时列消失有时保持?有没有我缺少的标志可以做这些事情?
  3. 在我给出的示例中,当我计算标签“LoginDaysSum”上显示的结果时,为什么要为结果添加新标签?

Thank you,

谢谢,

回答by jezrael

For return DataFrameafter groupbyare 2 possible solutions:

返回DataFramegroupby有两种可能的解决方案:

  1. parameter as_index=Falsewhat works nice with count, sum, meanfunctions

  2. reset_indexfor create new column from levels of index, more general solution

  1. 参数as_index=False什么与count, sum,mean函数配合使用

  2. reset_index用于从 的级别创建新列index,更通用的解决方案

df = ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
print (df)
   clienthostid  LoginDaysSum
0             1             4
1             3             2
df = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'].count().reset_index()
print (df)
   clienthostid  LoginDaysSum
0             1             4
1             3             2


For second need remove as_index=Falseand instead add reset_index:

对于第二个需要删除as_index=False并添加reset_index

#output is `Series`
a = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'] \
         .apply(lambda x: x.iloc[0] / x.iloc[1])
print (a)
clienthostid
1    1.0
3    1.5
Name: LoginDaysSum, dtype: float64

print (type(a))
<class 'pandas.core.series.Series'>

print (a.index)
Int64Index([1, 3], dtype='int64', name='clienthostid')


df1 = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum']
         .apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
print (df1)
   clienthostid  ratio
0             1    1.0
1             3    1.5

Why some columns are gone?

为什么有些列不见了?

I think there can be problem automatic exclusion of nuisance columns:

我认为自动排除令人讨厌的列可能存在问题:

#convert column to str
ttm.usersidid = ttm.usersidid.astype(str) + 'aa'
print (ttm)
  usersidid  clienthostid  eventSumTotal  LoginDaysSum  score
0      12aa             1             60             3   1728
1      11aa             1            240             3   1331
3       5aa             1              5             3    125
4       6aa             1             16             2    216
2      10aa             3            270             3   1000
5       8aa             3             18             2    512

#removed str column userid
a = ttm.groupby(['clienthostid'], sort=False).sum()
print (a)
              eventSumTotal  LoginDaysSum  score
clienthostid                                    
1                       321            11   3400
3                       288             5   1512

What is the difference between size and count in pandas?

Pandas的大小和数量有什么区别?

回答by piRSquared

countis a built in method for the groupbyobject and pandas knows what to do with it. There are two other things specified that goes into determining what the out put looks like.

countgroupby对象的内置方法,pandas 知道如何处理它。还指定了另外两件事情来确定输出的样子。

#                         For a built in method, when
#                         you don't want the group column
#                         as the index, pandas keeps it in
#                         as a column.
#                             |----||||----|
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()

   clienthostid  LoginDaysSum
0             1             4
1             3             2


#                         For a built in method, when
#                         you do want the group column
#                         as the index, then...
#                             |----||||---|
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].count()
#                                                       |-----||||-----|
#                                                 the single brackets tells
#                                                 pandas to operate on a series
#                                                 in this case, count the series

clienthostid
1    4
3    2
Name: LoginDaysSum, dtype: int64


ttm.groupby(['clienthostid'], as_index=True, sort=False)[['LoginDaysSum']].count()
#                                                       |------||||------|
#                                             the double brackets tells pandas
#                                                to operate on the dataframe
#                                              specified by these columns and will
#                                                return a dataframe

              LoginDaysSum
clienthostid              
1                        4
3                        2


When you used applypandas no longer knows what to do with the group column when you say as_index=False. It has to trust that if you use applyyou want returned exactly what you say to return, so it will just throw it away. Also, you have single brackets around your column which says to operate on a series. Instead, use as_index=Trueto keep the grouping column information in the index. Then follow it up with a reset_indexto transfer it from the index back into the dataframe. At this point, it will not have mattered that you used single brackets because after the reset_indexyou'll have a dataframe again.

当你使用applyPandas 时,不再知道如何处理组列时你说as_index=False. 它必须相信,如果你使用apply你想要返回的正是你所说的返回,所以它只会把它扔掉。此外,您的列周围有单个括号,表示对系列进行操作。相反,用于as_index=True将分组列信息保留在索引中。然后使用 areset_index将其从索引传输回数据帧。在这一点上,您使用单括号无关紧要,因为之后reset_index您将再次拥有一个数据框。

ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])

0    1.0
1    1.5
dtype: float64


ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()

   clienthostid  LoginDaysSum
0             1           1.0
1             3           1.5

回答by the_RR

Reading the groupy documentarion, a found out that automatic exclusion of columns after groupby usually caused by the presence of null values in that columns excluded.

阅读groupy 文档,发现 groupby 后自动排除列通常是由排除的列中存在空值引起的。

Try fill the 'null' with some value.

尝试用一些值填充“空”。

Like this:

像这样:

df.fillna('')