pandas 熊猫,分组并在组中找到最大值,返回值和计数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26701849/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas, groupby and finding maximum in groups, returning value and count
提问by bjelli
I have a pandas DataFrame with log data:
我有一个带有日志数据的 Pandas DataFrame:
host service
0 this.com mail
1 this.com mail
2 this.com web
3 that.com mail
4 other.net mail
5 other.net web
6 other.net web
And I want to find the service on every host that gives the most errors:
我想在每台主机上找到错误最多的服务:
host service no
0 this.com mail 2
1 that.com mail 1
2 other.net web 2
The only solution I found was grouping by host and service, and then iterating over the level 0 of the index.
我找到的唯一解决方案是按主机和服务分组,然后遍历索引的 0 级。
Can anyone suggest a better, shorter version? without the Iteration?
谁能推荐一个更好、更短的版本?没有迭代?
df = df_logfile.groupby(['host','service']).agg({'service':np.size})
df_count = pd.DataFrame()
df_count['host'] = df_logfile['host'].unique()
df_count['service'] = np.nan
df_count['no'] = np.nan
for h,data in df.groupby(level=0):
i = data.idxmax()[0]
service = i[1]
no = data.xs(i)[0]
df_count.loc[df_count['host'] == h, 'service'] = service
df_count.loc[(df_count['host'] == h) & (df_count['service'] == service), 'no'] = no
full code https://gist.github.com/bjelline/d8066de66e305887b714
采纳答案by unutbu
Given df, the next step is to group by the hostvalue alone and
aggregate by idxmax. This gives you the index which
corresponds the the greatest service value. You can then use df.loc[...]to select the rows in dfwhich correspond to the greatest service values:
鉴于df,下一步是host单独按值分组并按
聚合idxmax。这为您提供了与最大服务价值相对应的指数。然后,您可以使用df.loc[...]选择df对应于最大服务值的行:
import numpy as np
import pandas as pd
df_logfile = pd.DataFrame({
'host' : ['this.com', 'this.com', 'this.com', 'that.com', 'other.net',
'other.net', 'other.net'],
'service' : ['mail', 'mail', 'web', 'mail', 'mail', 'web', 'web' ] })
df = df_logfile.groupby(['host','service'])['service'].agg({'no':'count'})
mask = df.groupby(level=0).agg('idxmax')
df_count = df.loc[mask['no']]
df_count = df_count.reset_index()
print("\nOutput\n{}".format(df_count))
yields the DataFrame
产生数据帧
host service no
0 other.net web 2
1 that.com mail 1
2 this.com mail 2

