pandas 熊猫，分组并在组中找到最大值，返回值和计数

Question

提问by bjelli

I have a pandas DataFrame with log data:

我有一个带有日志数据的 Pandas DataFrame：

        host service
0   this.com    mail
1   this.com    mail
2   this.com     web
3   that.com    mail
4  other.net    mail
5  other.net     web
6  other.net     web

And I want to find the service on every host that gives the most errors:

我想在每台主机上找到错误最多的服务：

        host service  no
0   this.com    mail   2
1   that.com    mail   1
2  other.net     web   2

The only solution I found was grouping by host and service, and then iterating over the level 0 of the index.

我找到的唯一解决方案是按主机和服务分组，然后遍历索引的 0 级。

Can anyone suggest a better, shorter version? without the Iteration?

谁能推荐一个更好、更短的版本？没有迭代？

df = df_logfile.groupby(['host','service']).agg({'service':np.size})

df_count = pd.DataFrame()
df_count['host'] = df_logfile['host'].unique()
df_count['service']  = np.nan
df_count['no']    = np.nan

for h,data in df.groupby(level=0):
  i = data.idxmax()[0]   
  service = i[1]             
  no = data.xs(i)[0]
  df_count.loc[df_count['host'] == h, 'service'] = service
  df_count.loc[(df_count['host'] == h) & (df_count['service'] == service), 'no']   = no

full code https://gist.github.com/bjelline/d8066de66e305887b714

完整代码https://gist.github.com/bjelline/d8066de66e305887b714

Answer 1

采纳答案by unutbu

Given df, the next step is to group by the hostvalue alone and
aggregate by idxmax. This gives you the index which corresponds the the greatest service value. You can then use df.loc[...]to select the rows in dfwhich correspond to the greatest service values:

鉴于df，下一步是host单独按值分组并按
聚合idxmax。这为您提供了与最大服务价值相对应的指数。然后，您可以使用df.loc[...]选择df对应于最大服务值的行：

import numpy as np
import pandas as pd

df_logfile = pd.DataFrame({ 
    'host' : ['this.com', 'this.com', 'this.com', 'that.com', 'other.net', 
              'other.net', 'other.net'],
    'service' : ['mail', 'mail', 'web', 'mail', 'mail', 'web', 'web' ] })

df = df_logfile.groupby(['host','service'])['service'].agg({'no':'count'})
mask = df.groupby(level=0).agg('idxmax')
df_count = df.loc[mask['no']]
df_count = df_count.reset_index()
print("\nOutput\n{}".format(df_count))

yields the DataFrame

产生数据帧

        host service  no
0  other.net     web   2
1   that.com    mail   1
2   this.com    mail   2

pandas 熊猫，分组并在组中找到最大值，返回值和计数

提问by bjelli

采纳答案by unutbu

相关推荐

最近更新

标签

pandas 熊猫，分组并在组中找到最大值，返回值和计数

提问by bjelli

采纳答案by unutbu

相关推荐

Pandas groupby(),agg() - 如何在没有多索引的情况下返回结果？

使用索引在 Pandas 中查找两个系列之间的交集

Pandas 将 csv dateint 列读取到 datetime

Pandas：在不知道列名的情况下重命名单个 DataFrame 列

相关推荐

最近更新

标签