在 Pandas 中，在 groupby 之后分组列消失了

Question

提问by O. San

I have the following dataframe named ttm:

我有以下名为 ttm 的数据框：

    usersidid   clienthostid    eventSumTotal   LoginDaysSum    score
0       12          1               60              3           1728
1       11          1               240             3           1331
3       5           1               5               3           125
4       6           1               16              2           216
2       10          3               270             3           1000
5       8           3               18              2           512

When i do

当我做

ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()

I get what I expected (though I would've wanted the results to be under a new label named 'ratio'):

我得到了我的预期（虽然我希望结果在一个名为“ratio”的新标签下）：

       clienthostid  LoginDaysSum
0             1          4
1             3          2

But when I do

但是当我做

ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])

I get:

我得到：

0    1.0
1    1.5

Why did the labels go? I still also need the grouped need the 'clienthostid' and I need also the results of the apply to be under a label too
Sometimes when I do groupby some of the other columns still appear, why is that that sometimes columns disappear and sometime stays? is there a flag I'm missing that do those stuff?
In the example that I gave, when I did count the results showed on label 'LoginDaysSum', is there a why to add a new label for the results instead?

为什么标签消失了？我还需要分组需要“clienthostid”，我还需要将申请结果也放在标签下
有时当我执行 groupby 时，其他一些列仍然出现，为什么有时列消失有时保持？有没有我缺少的标志可以做这些事情？
在我给出的示例中，当我计算标签“LoginDaysSum”上显示的结果时，为什么要为结果添加新标签？

Thank you,

谢谢，

Answer 1

回答by jezrael

For return DataFrameafter groupbyare 2 possible solutions:

返回DataFrame后groupby有两种可能的解决方案：

parameter as_index=Falsewhat works nice with count, sum, meanfunctions
reset_indexfor create new column from levels of index, more general solution

参数as_index=False什么与count, sum,mean函数配合使用
reset_index用于从的级别创建新列index，更通用的解决方案

df = ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
print (df)
   clienthostid  LoginDaysSum
0             1             4
1             3             2

df = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'].count().reset_index()
print (df)
   clienthostid  LoginDaysSum
0             1             4
1             3             2

For second need remove as_index=Falseand instead add reset_index:

对于第二个需要删除as_index=False并添加reset_index：

#output is `Series`
a = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'] \
         .apply(lambda x: x.iloc[0] / x.iloc[1])
print (a)
clienthostid
1    1.0
3    1.5
Name: LoginDaysSum, dtype: float64

print (type(a))
<class 'pandas.core.series.Series'>

print (a.index)
Int64Index([1, 3], dtype='int64', name='clienthostid')


df1 = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum']
         .apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
print (df1)
   clienthostid  ratio
0             1    1.0
1             3    1.5

Why some columns are gone?

为什么有些列不见了？

I think there can be problem automatic exclusion of nuisance columns:

我认为自动排除令人讨厌的列可能存在问题：

#convert column to str
ttm.usersidid = ttm.usersidid.astype(str) + 'aa'
print (ttm)
  usersidid  clienthostid  eventSumTotal  LoginDaysSum  score
0      12aa             1             60             3   1728
1      11aa             1            240             3   1331
3       5aa             1              5             3    125
4       6aa             1             16             2    216
2      10aa             3            270             3   1000
5       8aa             3             18             2    512

#removed str column userid
a = ttm.groupby(['clienthostid'], sort=False).sum()
print (a)
              eventSumTotal  LoginDaysSum  score
clienthostid                                    
1                       321            11   3400
3                       288             5   1512

What is the difference between size and count in pandas?

Pandas的大小和数量有什么区别？

Answer 2

回答by piRSquared

countis a built in method for the groupbyobject and pandas knows what to do with it. There are two other things specified that goes into determining what the out put looks like.

count是groupby对象的内置方法，pandas 知道如何处理它。还指定了另外两件事情来确定输出的样子。

#                         For a built in method, when
#                         you don't want the group column
#                         as the index, pandas keeps it in
#                         as a column.
#                             |----||||----|
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()

   clienthostid  LoginDaysSum
0             1             4
1             3             2

#                         For a built in method, when
#                         you do want the group column
#                         as the index, then...
#                             |----||||---|
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].count()
#                                                       |-----||||-----|
#                                                 the single brackets tells
#                                                 pandas to operate on a series
#                                                 in this case, count the series

clienthostid
1    4
3    2
Name: LoginDaysSum, dtype: int64

ttm.groupby(['clienthostid'], as_index=True, sort=False)[['LoginDaysSum']].count()
#                                                       |------||||------|
#                                             the double brackets tells pandas
#                                                to operate on the dataframe
#                                              specified by these columns and will
#                                                return a dataframe

              LoginDaysSum
clienthostid              
1                        4
3                        2

When you used applypandas no longer knows what to do with the group column when you say as_index=False. It has to trust that if you use applyyou want returned exactly what you say to return, so it will just throw it away. Also, you have single brackets around your column which says to operate on a series. Instead, use as_index=Trueto keep the grouping column information in the index. Then follow it up with a reset_indexto transfer it from the index back into the dataframe. At this point, it will not have mattered that you used single brackets because after the reset_indexyou'll have a dataframe again.

当你使用applyPandas 时，不再知道如何处理组列时你说as_index=False. 它必须相信，如果你使用apply你想要返回的正是你所说的返回，所以它只会把它扔掉。此外，您的列周围有单个括号，表示对系列进行操作。相反，用于as_index=True将分组列信息保留在索引中。然后使用 areset_index将其从索引传输回数据帧。在这一点上，您使用单括号无关紧要，因为之后reset_index您将再次拥有一个数据框。

ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])

0    1.0
1    1.5
dtype: float64

ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()

   clienthostid  LoginDaysSum
0             1           1.0
1             3           1.5

Answer 3

回答by the_RR

Reading the groupy documentarion, a found out that automatic exclusion of columns after groupby usually caused by the presence of null values in that columns excluded.

阅读groupy 文档，发现 groupby 后自动排除列通常是由排除的列中存在空值引起的。

Try fill the 'null' with some value.

尝试用一些值填充“空”。

Like this:

像这样：

df.fillna('')

在 Pandas 中，在 groupby 之后分组列消失了

提问by O. San

回答by jezrael

回答by piRSquared

回答by the_RR

相关推荐

最近更新

标签

在 Pandas 中，在 groupby 之后分组列消失了

提问by O. San

回答by jezrael

回答by piRSquared

回答by the_RR

相关推荐

pandas Python 数据框：达到条件之前的列的累积总和并返回索引

pandas 数据框，如何获取某个索引值的平均值

pandas 将字符串日期时间转换为熊猫日期时间

pandas 在 .gitlab-ci.yml 中使用 apt-get install python 包

相关推荐

最近更新

标签