用于数据帧中多个索引的 Python pandas idxmax

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27914360/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:50:18  来源:igfitidea点击:

Python pandas idxmax for multiple indexes in a dataframe

pythonpandasmulti-index

提问by Alex Petralia

I have a series that looks like this:

我有一个看起来像这样的系列:

            delivery
2007-04-26  706           23
2007-04-27  705           10
            706         1089
            708           83
            710           13
            712           51
            802            4
            806            1
            812            3
2007-04-29  706           39
            708            4
            712            1
2007-04-30  705            3
            706         1016
            707            2
...
2014-11-04  1412          53
            1501           1
            1502           1
            1512           1
2014-11-05  1411          47
            1412        1334
            1501          40
            1502         433
            1504         126
            1506         100
            1508           7
            1510           6
            1512          51
            1604           1
            1612           5
Length: 26255, dtype: int64

where the query is: df.groupby([df.index.date, 'delivery']).size()

其中查询是: df.groupby([df.index.date, 'delivery']).size()

For each day, I need to pull out the delivery number which has the most volume. I feel like it would be something like:

对于每一天,我需要提取数量最多的交货编号。我觉得它会是这样的:

df.groupby([df.index.date, 'delivery']).size().idxmax(axis=1)

However, this just returns me the idxmax for the entire dataframe; instead, I need the second-level idmax (not the date but rather the delivery number) for each day, not the entire dataframe (ie. it returns a vector).

但是,这只会返回整个数据帧的 idxmax;相反,我需要每一天的二级 idmax(不是日期,而是交付编号),而不是整个数据帧(即它返回一个向量)。

Any ideas on how to accomplish this?

关于如何实现这一点的任何想法?

回答by knightofni

Your example code doesn't work because the idxmax is executed after the groupby operation (so on the whole dataframe)

您的示例代码不起作用,因为 idxmax 在 groupby 操作之后执行(因此在整个数据帧上)

I'm not sure how to use idxmax on multilevel indexes, so here's a simple workaround.

我不确定如何在多级索引上使用 idxmax,所以这里有一个简单的解决方法。

Setting up data :

设置数据:

import pandas as pd
d= {'Date': ['2007-04-26', '2007-04-27', '2007-04-27', '2007-04-27',
             '2007-04-27', '2007-04-28', '2007-04-28'], 
        'DeliveryNb': [706, 705, 708, 450, 283, 45, 89],
        'DeliveryCount': [23, 10, 1089, 82, 34, 100, 11]}

df = pd.DataFrame.from_dict(d, orient='columns').set_index('Date')
print df

output

输出

            DeliveryCount  DeliveryNb
Date                                 
2007-04-26             23         706
2007-04-27             10         705
2007-04-27           1089         708
2007-04-27             82         450
2007-04-27             34         283
2007-04-28            100          45
2007-04-28             11          89

creating custom function :

创建自定义函数:

The trick is to use the reset_index() method (so you easily get the integer index of the group)

诀窍是使用 reset_index() 方法(这样你很容易得到组的整数索引)

def func(df):
    idx = df.reset_index()['DeliveryCount'].idxmax()
    return df['DeliveryNb'].iloc[idx]

applying it :

应用它:

g = df.groupby(df.index)
g.apply(func)

result :

结果 :

Date
2007-04-26    706
2007-04-27    708
2007-04-28     45
dtype: int64

回答by unutbu

Suppose you have this series:

假设你有这个系列:

            delivery
2001-01-02  0           2
            1           3
            6           2
            7           2
            9           3
2001-01-03  3           2
            6           1
            7           1
            8           3
            9           1
dtype: int64

If you want one deliveryper date with the maximum value, you could use idxmax:

如果您希望每个日期具有最大值的一次交货,您可以使用idxmax

dates = series.index.get_level_values(0)
series.loc[series.groupby(dates).idxmax()]

yields

产量

            delivery
2001-01-02  1           3
2001-01-03  8           3
dtype: int64

If you want all deliveriesper date with the maximum value, use transformto generate a boolean mask:

如果您希望每个日期的所有交付都具有最大值,请使用transform生成布尔掩码

mask = series.groupby(dates).transform(lambda x: x==x.max()).astype('bool')
series.loc[mask]

yields

产量

            delivery
2001-01-02  1           3
            9           3
2001-01-03  8           3
dtype: int64


This is the code I used to generate series:

这是我用来生成的代码series

import pandas as pd
import numpy as np

np.random.seed(1)
N = 20
rng = pd.date_range('2001-01-02', periods=N//2, freq='4H')
rng = np.random.choice(rng, N, replace=True)
rng.sort()
df = pd.DataFrame(np.random.randint(10, size=(N,)), columns=['delivery'], index=rng)
series = df.groupby([df.index.date, 'delivery']).size()

回答by Amir F

If you have the following dataframe (you can always reset the index if needed with : df = df.reset_index():

如果您有以下数据框(如果需要,您可以随时使用 :: 重置索引df = df.reset_index()

  Date  Del_Count  Del_Nb
0  1/1      14      19   <
1           11      17
2  2/2      25      29   <
3           21      27
4           22      28
5  3/3      34      36
6           37      37
7           31      39   <

To find the max per Date and extract the relevant Del_Count you can use:

要查找每个日期的最大值并提取相关的 Del_Count,您可以使用:

df = df.ix[df.groupby(['Date'], sort=False)['Del_Nb'].idxmax()][['Date','Del_Count','Del_Nb']]

Which would yeild:

哪个会产生:

 Date  Del_Count  Del_Nb
0  1/1         14      19
2  2/2         25      29
7  3/3         31      39