Python Pandas 获得每组中最高的 n 条记录

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20069009/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:25:44  来源:igfitidea点击:

Pandas get topmost n records within each group

pythonpandasgreatest-n-per-groupwindow-functionstop-n

提问by Roman Pekar

Suppose I have pandas DataFrame like this:

假设我有这样的 Pandas DataFrame:

>>> df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
>>> df
   id  value
0   1      1
1   1      2
2   1      3
3   2      1
4   2      2
5   2      3
6   2      4
7   3      1
8   4      1

I want to get a new DataFrame with top 2 records for each id, like this:

我想为每个 id 获取一个包含前 2 条记录的新 DataFrame,如下所示:

   id  value
0   1      1
1   1      2
3   2      1
4   2      2
7   3      1
8   4      1

I can do it with numbering records within group after group by:

我可以通过以下方式在组内编号记录:

>>> dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
>>> dfN
   id  level_1  index  value
0   1        0      0      1
1   1        1      1      2
2   1        2      2      3
3   2        0      3      1
4   2        1      4      2
5   2        2      5      3
6   2        3      6      4
7   3        0      7      1
8   4        0      8      1
>>> dfN[dfN['level_1'] <= 1][['id', 'value']]
   id  value
0   1      1
1   1      2
3   2      1
4   2      2
7   3      1
8   4      1

But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).

但是有没有更有效/优雅的方法来做到这一点?并且还有更优雅的方法来对每个组中的记录进行编号(如 SQL 窗口函数row_number())。

采纳答案by dorvak

Did you try df.groupby('id').head(2)

你试过了吗 df.groupby('id').head(2)

Ouput generated:

输出生成:

>>> df.groupby('id').head(2)
       id  value
id             
1  0   1      1
   1   1      2 
2  3   2      1
   4   2      2
3  7   3      1
4  8   4      1

(Keep in mind that you might need to order/sort before, depending on your data)

(请记住,您可能需要先订购/排序,具体取决于您的数据)

EDIT: As mentioned by the questioner, use df.groupby('id').head(2).reset_index(drop=True)to remove the multindex and flatten the results.

编辑:正如提问者所提到的,用于df.groupby('id').head(2).reset_index(drop=True)删除多重索引并使结果变平。

>>> df.groupby('id').head(2).reset_index(drop=True)
    id  value
0   1      1
1   1      2
2   2      1
3   2      2
4   3      1
5   4      1

回答by LondonRob

Since 0.14.1, you can now do nlargestand nsmalleston a groupbyobject:

从 0.14.1 开始,您现在可以在对象上执行nlargestnsmallest操作groupby

In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]: 
id   
1   2    3
    1    2
2   6    4
    5    3
3   7    1
4   8    1
dtype: int64

There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.

您在那里获得原始索引也有点奇怪,但这可能非常有用,具体取决于您的原始索引什么。

If you're not interested in it, you can do .reset_index(level=1, drop=True)to get rid of it altogether.

如果你对它不感兴趣,你可以.reset_index(level=1, drop=True)完全摆脱它。

(Note: From 0.17.1you'll be able to do this on a DataFrameGroupBy too but for now it only works with Seriesand SeriesGroupBy.)

(注意:从 0.17.1 开始,您也可以在 DataFrameGroupBy上执行此操作,但目前它仅适用于SeriesSeriesGroupBy。)

回答by Chaffee Chen

Sometimes sorting the whole data ahead is very time consuming. We can groupby first and doing topk for each group:

有时,提前对整个数据进行排序非常耗时。我们可以先分组,然后对每个组做topk:

g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)