用于数据帧中多个索引的 Python pandas idxmax

Question

提问by Alex Petralia

I have a series that looks like this:

我有一个看起来像这样的系列：

            delivery
2007-04-26  706           23
2007-04-27  705           10
            706         1089
            708           83
            710           13
            712           51
            802            4
            806            1
            812            3
2007-04-29  706           39
            708            4
            712            1
2007-04-30  705            3
            706         1016
            707            2
...
2014-11-04  1412          53
            1501           1
            1502           1
            1512           1
2014-11-05  1411          47
            1412        1334
            1501          40
            1502         433
            1504         126
            1506         100
            1508           7
            1510           6
            1512          51
            1604           1
            1612           5
Length: 26255, dtype: int64

where the query is: df.groupby([df.index.date, 'delivery']).size()

其中查询是： df.groupby([df.index.date, 'delivery']).size()

For each day, I need to pull out the delivery number which has the most volume. I feel like it would be something like:

对于每一天，我需要提取数量最多的交货编号。我觉得它会是这样的：

df.groupby([df.index.date, 'delivery']).size().idxmax(axis=1)

However, this just returns me the idxmax for the entire dataframe; instead, I need the second-level idmax (not the date but rather the delivery number) for each day, not the entire dataframe (ie. it returns a vector).

但是，这只会返回整个数据帧的 idxmax；相反，我需要每一天的二级 idmax（不是日期，而是交付编号），而不是整个数据帧（即它返回一个向量）。

Any ideas on how to accomplish this?

关于如何实现这一点的任何想法？

Answer 1

回答by knightofni

Your example code doesn't work because the idxmax is executed after the groupby operation (so on the whole dataframe)

您的示例代码不起作用，因为 idxmax 在 groupby 操作之后执行（因此在整个数据帧上）

I'm not sure how to use idxmax on multilevel indexes, so here's a simple workaround.

我不确定如何在多级索引上使用 idxmax，所以这里有一个简单的解决方法。

Setting up data :

设置数据：

import pandas as pd
d= {'Date': ['2007-04-26', '2007-04-27', '2007-04-27', '2007-04-27',
             '2007-04-27', '2007-04-28', '2007-04-28'], 
        'DeliveryNb': [706, 705, 708, 450, 283, 45, 89],
        'DeliveryCount': [23, 10, 1089, 82, 34, 100, 11]}

df = pd.DataFrame.from_dict(d, orient='columns').set_index('Date')
print df

output

输出

            DeliveryCount  DeliveryNb
Date                                 
2007-04-26             23         706
2007-04-27             10         705
2007-04-27           1089         708
2007-04-27             82         450
2007-04-27             34         283
2007-04-28            100          45
2007-04-28             11          89

creating custom function :

创建自定义函数：

The trick is to use the reset_index() method (so you easily get the integer index of the group)

诀窍是使用 reset_index() 方法（这样你很容易得到组的整数索引）

def func(df):
    idx = df.reset_index()['DeliveryCount'].idxmax()
    return df['DeliveryNb'].iloc[idx]

applying it :

应用它：

g = df.groupby(df.index)
g.apply(func)

result :

结果：

Date
2007-04-26    706
2007-04-27    708
2007-04-28     45
dtype: int64

Answer 2

回答by unutbu

Suppose you have this series:

假设你有这个系列：

            delivery
2001-01-02  0           2
            1           3
            6           2
            7           2
            9           3
2001-01-03  3           2
            6           1
            7           1
            8           3
            9           1
dtype: int64

If you want one deliveryper date with the maximum value, you could use idxmax:

如果您希望每个日期具有最大值的一次交货，您可以使用idxmax：

dates = series.index.get_level_values(0)
series.loc[series.groupby(dates).idxmax()]

yields

产量

            delivery
2001-01-02  1           3
2001-01-03  8           3
dtype: int64

If you want all deliveriesper date with the maximum value, use transformto generate a boolean mask:

如果您希望每个日期的所有交付都具有最大值，请使用transform生成布尔掩码：

mask = series.groupby(dates).transform(lambda x: x==x.max()).astype('bool')
series.loc[mask]

yields

产量

            delivery
2001-01-02  1           3
            9           3
2001-01-03  8           3
dtype: int64

This is the code I used to generate series:

这是我用来生成的代码series：

import pandas as pd
import numpy as np

np.random.seed(1)
N = 20
rng = pd.date_range('2001-01-02', periods=N//2, freq='4H')
rng = np.random.choice(rng, N, replace=True)
rng.sort()
df = pd.DataFrame(np.random.randint(10, size=(N,)), columns=['delivery'], index=rng)
series = df.groupby([df.index.date, 'delivery']).size()

Answer 3

回答by Amir F

If you have the following dataframe (you can always reset the index if needed with : df = df.reset_index():

如果您有以下数据框（如果需要，您可以随时使用 :: 重置索引df = df.reset_index()：

  Date  Del_Count  Del_Nb
0  1/1      14      19   <
1           11      17
2  2/2      25      29   <
3           21      27
4           22      28
5  3/3      34      36
6           37      37
7           31      39   <

To find the max per Date and extract the relevant Del_Count you can use:

要查找每个日期的最大值并提取相关的 Del_Count，您可以使用：

df = df.ix[df.groupby(['Date'], sort=False)['Del_Nb'].idxmax()][['Date','Del_Count','Del_Nb']]

Which would yeild:

哪个会产生：

 Date  Del_Count  Del_Nb
0  1/1         14      19
2  2/2         25      29
7  3/3         31      39

用于数据帧中多个索引的 Python pandas idxmax

提问by Alex Petralia

回答by knightofni

回答by unutbu

回答by Amir F

相关推荐

最近更新

标签

用于数据帧中多个索引的 Python pandas idxmax

提问by Alex Petralia

回答by knightofni

回答by unutbu

回答by Amir F

相关推荐

Pandas OneHotEncoder.fit(dataframe) 返回 ValueError: 以 10 为基数的 long() 的无效文字

pandas 计算Python数组中连续的正值

从 pandas.Series 中选择局部最小值和最大值

Pandas - 用 groupby mean 替换异常值

相关推荐

最近更新

标签