用于数据帧中多个索引的 Python pandas idxmax
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27914360/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python pandas idxmax for multiple indexes in a dataframe
提问by Alex Petralia
I have a series that looks like this:
我有一个看起来像这样的系列:
delivery
2007-04-26 706 23
2007-04-27 705 10
706 1089
708 83
710 13
712 51
802 4
806 1
812 3
2007-04-29 706 39
708 4
712 1
2007-04-30 705 3
706 1016
707 2
...
2014-11-04 1412 53
1501 1
1502 1
1512 1
2014-11-05 1411 47
1412 1334
1501 40
1502 433
1504 126
1506 100
1508 7
1510 6
1512 51
1604 1
1612 5
Length: 26255, dtype: int64
where the query is: df.groupby([df.index.date, 'delivery']).size()
其中查询是: df.groupby([df.index.date, 'delivery']).size()
For each day, I need to pull out the delivery number which has the most volume. I feel like it would be something like:
对于每一天,我需要提取数量最多的交货编号。我觉得它会是这样的:
df.groupby([df.index.date, 'delivery']).size().idxmax(axis=1)
However, this just returns me the idxmax for the entire dataframe; instead, I need the second-level idmax (not the date but rather the delivery number) for each day, not the entire dataframe (ie. it returns a vector).
但是,这只会返回整个数据帧的 idxmax;相反,我需要每一天的二级 idmax(不是日期,而是交付编号),而不是整个数据帧(即它返回一个向量)。
Any ideas on how to accomplish this?
关于如何实现这一点的任何想法?
回答by knightofni
Your example code doesn't work because the idxmax is executed after the groupby operation (so on the whole dataframe)
您的示例代码不起作用,因为 idxmax 在 groupby 操作之后执行(因此在整个数据帧上)
I'm not sure how to use idxmax on multilevel indexes, so here's a simple workaround.
我不确定如何在多级索引上使用 idxmax,所以这里有一个简单的解决方法。
Setting up data :
设置数据:
import pandas as pd
d= {'Date': ['2007-04-26', '2007-04-27', '2007-04-27', '2007-04-27',
'2007-04-27', '2007-04-28', '2007-04-28'],
'DeliveryNb': [706, 705, 708, 450, 283, 45, 89],
'DeliveryCount': [23, 10, 1089, 82, 34, 100, 11]}
df = pd.DataFrame.from_dict(d, orient='columns').set_index('Date')
print df
output
输出
DeliveryCount DeliveryNb
Date
2007-04-26 23 706
2007-04-27 10 705
2007-04-27 1089 708
2007-04-27 82 450
2007-04-27 34 283
2007-04-28 100 45
2007-04-28 11 89
creating custom function :
创建自定义函数:
The trick is to use the reset_index() method (so you easily get the integer index of the group)
诀窍是使用 reset_index() 方法(这样你很容易得到组的整数索引)
def func(df):
idx = df.reset_index()['DeliveryCount'].idxmax()
return df['DeliveryNb'].iloc[idx]
applying it :
应用它:
g = df.groupby(df.index)
g.apply(func)
result :
结果 :
Date
2007-04-26 706
2007-04-27 708
2007-04-28 45
dtype: int64
回答by unutbu
Suppose you have this series:
假设你有这个系列:
delivery
2001-01-02 0 2
1 3
6 2
7 2
9 3
2001-01-03 3 2
6 1
7 1
8 3
9 1
dtype: int64
If you want one deliveryper date with the maximum value, you could use idxmax:
如果您希望每个日期具有最大值的一次交货,您可以使用idxmax:
dates = series.index.get_level_values(0)
series.loc[series.groupby(dates).idxmax()]
yields
产量
delivery
2001-01-02 1 3
2001-01-03 8 3
dtype: int64
If you want all deliveriesper date with the maximum value, use transformto generate a boolean mask:
如果您希望每个日期的所有交付都具有最大值,请使用transform生成布尔掩码:
mask = series.groupby(dates).transform(lambda x: x==x.max()).astype('bool')
series.loc[mask]
yields
产量
delivery
2001-01-02 1 3
9 3
2001-01-03 8 3
dtype: int64
This is the code I used to generate series:
这是我用来生成的代码series:
import pandas as pd
import numpy as np
np.random.seed(1)
N = 20
rng = pd.date_range('2001-01-02', periods=N//2, freq='4H')
rng = np.random.choice(rng, N, replace=True)
rng.sort()
df = pd.DataFrame(np.random.randint(10, size=(N,)), columns=['delivery'], index=rng)
series = df.groupby([df.index.date, 'delivery']).size()
回答by Amir F
If you have the following dataframe (you can always reset the index if needed with : df = df.reset_index():
如果您有以下数据框(如果需要,您可以随时使用 :: 重置索引df = df.reset_index():
Date Del_Count Del_Nb
0 1/1 14 19 <
1 11 17
2 2/2 25 29 <
3 21 27
4 22 28
5 3/3 34 36
6 37 37
7 31 39 <
To find the max per Date and extract the relevant Del_Count you can use:
要查找每个日期的最大值并提取相关的 Del_Count,您可以使用:
df = df.ix[df.groupby(['Date'], sort=False)['Del_Nb'].idxmax()][['Date','Del_Count','Del_Nb']]
Which would yeild:
哪个会产生:
Date Del_Count Del_Nb
0 1/1 14 19
2 2/2 25 29
7 3/3 31 39

