pandas Python 熊猫:在 groupby 中选择第二个最小值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24943902/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:17:39  来源:igfitidea点击:

Python pandas: select 2nd smallest value in groupby

pythonpandas

提问by midtownguru

I have an example DataFrame like the following:

我有一个示例 DataFrame,如下所示:

import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[1,2,2,2,3,3,], 'date':array(['2000-01-01','2002-01-01','2010-01-01','2003-01-01','2004-01-01','2008-01-01'],dtype='datetime64[D]')})

I am trying to get the 2nd earliest day in each ID group. So I wrote the following funciton:

我试图在每个 ID 组中获得第 2 天。所以我写了以下函数:

def f(x):
    if len(x)==1:
        return x[0]
    else:
        x.sort()
        return x[1]

And then I wrote:

然后我写道:

df.groupby('ID').date.apply(lambda x:f(x))

The result is an error.

结果是错误。

Could you find a way to make this work?

你能找到一种方法来完成这项工作吗?

采纳答案by Jeff

This requires 0.14.1. And will be quite efficient, especially if you have large groups (as this doesn't require fully sorting them).

这需要 0.14.1。并且将非常有效,特别是如果您有大组(因为这不需要对它们进行完全排序)。

In [32]: df.groupby('ID')['date'].nsmallest(2)
Out[32]: 
ID   
1   0   2000-01-01
2   1   2002-01-01
    3   2003-01-01
3   4   2004-01-01
    5   2008-01-01
dtype: datetime64[ns]

In [33]: df.groupby('ID')['date'].nsmallest(2).groupby(level='ID').last()
Out[33]: 
ID
1    2000-01-01
2    2003-01-01
3    2008-01-01
dtype: datetime64[ns]

回答by chrisb

Take a look at the indexing docs- in general pandas defaults to indexing by label, rather than location - that's why you get a KeyError.

看一下索引文档- 通常,pandas 默认按标签索引,而不是按位置索引 - 这就是为什么你会得到KeyError.

In your particular case you could use .ilocfor location based indexing.

在您的特定情况下,您可以使用.iloc基于位置的索引。

In [266]: def f(x):
     ...:     if len(x)==1:
     ...:         return x.iloc[0]
     ...:     else:
     ...:         x.sort()
     ...:         return x.iloc[1]
     ...:     

In [267]: df.groupby('ID').date.apply(f)
Out[267]: 
ID
1    2000-01-01
2    2003-01-01
3    2008-01-01
Name: date, dtype: datetime64[ns]

回答by scottlittle

You may not want to return the first and only value as the second value as in the accepted answer (i.e., 2000-01-01is not the second value, but the only value). If this is the case, you can rank each group and be able to select the first, second, third, etc. smallest value more generically:

您可能不想像接受的答案那样将第一个也是唯一的值作为第二个值返回(即,2000-01-01不是第二个值,而是唯一的值)。如果是这种情况,您可以对每个组进行排名,并能够更一般地选择第一个、第二个、第三个等最小值:

df['rank'] = df.sort_values('date').groupby('ID').cumcount()+1

For the second smallest value:

对于第二个最小值:

df[df['rank'] == 2]

this returns

这返回

ID  date        rank
2   2003-01-01  2
3   2008-01-01  2