pandas 熊猫,分组并在组中找到最大值,返回值和计数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26701849/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:37:33  来源:igfitidea点击:

Pandas, groupby and finding maximum in groups, returning value and count

pythonnumpypandas

提问by bjelli

I have a pandas DataFrame with log data:

我有一个带有日志数据的 Pandas DataFrame:

        host service
0   this.com    mail
1   this.com    mail
2   this.com     web
3   that.com    mail
4  other.net    mail
5  other.net     web
6  other.net     web

And I want to find the service on every host that gives the most errors:

我想在每台主机上找到错误最多的服务:

        host service  no
0   this.com    mail   2
1   that.com    mail   1
2  other.net     web   2

The only solution I found was grouping by host and service, and then iterating over the level 0 of the index.

我找到的唯一解决方案是按主机和服务分组,然后遍历索引的 0 级。

Can anyone suggest a better, shorter version? without the Iteration?

谁能推荐一个更好、更短的版本?没有迭代?

df = df_logfile.groupby(['host','service']).agg({'service':np.size})

df_count = pd.DataFrame()
df_count['host'] = df_logfile['host'].unique()
df_count['service']  = np.nan
df_count['no']    = np.nan

for h,data in df.groupby(level=0):
  i = data.idxmax()[0]   
  service = i[1]             
  no = data.xs(i)[0]
  df_count.loc[df_count['host'] == h, 'service'] = service
  df_count.loc[(df_count['host'] == h) & (df_count['service'] == service), 'no']   = no

full code https://gist.github.com/bjelline/d8066de66e305887b714

完整代码https://gist.github.com/bjelline/d8066de66e305887b714

采纳答案by unutbu

Given df, the next step is to group by the hostvalue alone and
aggregate by idxmax. This gives you the index which corresponds the the greatest service value. You can then use df.loc[...]to select the rows in dfwhich correspond to the greatest service values:

鉴于df,下一步是host单独按值分组并按
聚合idxmax。这为您提供了与最大服务价值相对应的指数。然后,您可以使用df.loc[...]选择df对应于最大服务值的行:

import numpy as np
import pandas as pd

df_logfile = pd.DataFrame({ 
    'host' : ['this.com', 'this.com', 'this.com', 'that.com', 'other.net', 
              'other.net', 'other.net'],
    'service' : ['mail', 'mail', 'web', 'mail', 'mail', 'web', 'web' ] })

df = df_logfile.groupby(['host','service'])['service'].agg({'no':'count'})
mask = df.groupby(level=0).agg('idxmax')
df_count = df.loc[mask['no']]
df_count = df_count.reset_index()
print("\nOutput\n{}".format(df_count))

yields the DataFrame

产生数据帧

        host service  no
0  other.net     web   2
1   that.com    mail   1
2   this.com    mail   2