pandas Python 熊猫：在 groupby 中选择第二个最小值

Question

提问by midtownguru

I have an example DataFrame like the following:

我有一个示例 DataFrame，如下所示：

import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[1,2,2,2,3,3,], 'date':array(['2000-01-01','2002-01-01','2010-01-01','2003-01-01','2004-01-01','2008-01-01'],dtype='datetime64[D]')})

I am trying to get the 2nd earliest day in each ID group. So I wrote the following funciton:

我试图在每个 ID 组中获得第 2 天。所以我写了以下函数：

def f(x):
    if len(x)==1:
        return x[0]
    else:
        x.sort()
        return x[1]

And then I wrote:

然后我写道：

df.groupby('ID').date.apply(lambda x:f(x))

The result is an error.

结果是错误。

Could you find a way to make this work?

你能找到一种方法来完成这项工作吗？

Answer 1

采纳答案by Jeff

This requires 0.14.1. And will be quite efficient, especially if you have large groups (as this doesn't require fully sorting them).

这需要 0.14.1。并且将非常有效，特别是如果您有大组（因为这不需要对它们进行完全排序）。

In [32]: df.groupby('ID')['date'].nsmallest(2)
Out[32]: 
ID   
1   0   2000-01-01
2   1   2002-01-01
    3   2003-01-01
3   4   2004-01-01
    5   2008-01-01
dtype: datetime64[ns]

In [33]: df.groupby('ID')['date'].nsmallest(2).groupby(level='ID').last()
Out[33]: 
ID
1    2000-01-01
2    2003-01-01
3    2008-01-01
dtype: datetime64[ns]

Answer 2

回答by chrisb

Take a look at the indexing docs- in general pandas defaults to indexing by label, rather than location - that's why you get a KeyError.

看一下索引文档- 通常，pandas 默认按标签索引，而不是按位置索引 - 这就是为什么你会得到KeyError.

In your particular case you could use .ilocfor location based indexing.

在您的特定情况下，您可以使用.iloc基于位置的索引。

In [266]: def f(x):
     ...:     if len(x)==1:
     ...:         return x.iloc[0]
     ...:     else:
     ...:         x.sort()
     ...:         return x.iloc[1]
     ...:     

In [267]: df.groupby('ID').date.apply(f)
Out[267]: 
ID
1    2000-01-01
2    2003-01-01
3    2008-01-01
Name: date, dtype: datetime64[ns]

Answer 3

回答by scottlittle

You may not want to return the first and only value as the second value as in the accepted answer (i.e., 2000-01-01is not the second value, but the only value). If this is the case, you can rank each group and be able to select the first, second, third, etc. smallest value more generically:

您可能不想像接受的答案那样将第一个也是唯一的值作为第二个值返回（即，2000-01-01不是第二个值，而是唯一的值）。如果是这种情况，您可以对每个组进行排名，并能够更一般地选择第一个、第二个、第三个等最小值：

df['rank'] = df.sort_values('date').groupby('ID').cumcount()+1

For the second smallest value:

对于第二个最小值：

df[df['rank'] == 2]

this returns

这返回

ID  date        rank
2   2003-01-01  2
3   2008-01-01  2

pandas Python 熊猫：在 groupby 中选择第二个最小值

提问by midtownguru

采纳答案by Jeff

回答by chrisb

回答by scottlittle

相关推荐

最近更新

标签

pandas Python 熊猫：在 groupby 中选择第二个最小值

提问by midtownguru

采纳答案by Jeff

回答by chrisb

回答by scottlittle

相关推荐

pandas 如何使用月/年分辨率（用几行代码）绘制熊猫时间序列？

pandas 熊猫 - 非常非常慢

pandas 如果值出现在熊猫数据框的任何列中，如何打印行

pandas Groupby 给定所选 DataFrame 列值的百分位数

相关推荐

最近更新

标签