Python Pandas 获得每组中最高的 n 条记录

Question

提问by Roman Pekar

Suppose I have pandas DataFrame like this:

假设我有这样的 Pandas DataFrame：

>>> df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
>>> df
   id  value
0   1      1
1   1      2
2   1      3
3   2      1
4   2      2
5   2      3
6   2      4
7   3      1
8   4      1

I want to get a new DataFrame with top 2 records for each id, like this:

我想为每个 id 获取一个包含前 2 条记录的新 DataFrame，如下所示：

   id  value
0   1      1
1   1      2
3   2      1
4   2      2
7   3      1
8   4      1

I can do it with numbering records within group after group by:

我可以通过以下方式在组内编号记录：

>>> dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
>>> dfN
   id  level_1  index  value
0   1        0      0      1
1   1        1      1      2
2   1        2      2      3
3   2        0      3      1
4   2        1      4      2
5   2        2      5      3
6   2        3      6      4
7   3        0      7      1
8   4        0      8      1
>>> dfN[dfN['level_1'] <= 1][['id', 'value']]
   id  value
0   1      1
1   1      2
3   2      1
4   2      2
7   3      1
8   4      1

But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).

但是有没有更有效/优雅的方法来做到这一点？并且还有更优雅的方法来对每个组中的记录进行编号（如 SQL 窗口函数row_number()）。

Answer 1

采纳答案by dorvak

Did you try df.groupby('id').head(2)

你试过了吗 df.groupby('id').head(2)

Ouput generated:

输出生成：

>>> df.groupby('id').head(2)
       id  value
id             
1  0   1      1
   1   1      2 
2  3   2      1
   4   2      2
3  7   3      1
4  8   4      1

(Keep in mind that you might need to order/sort before, depending on your data)

（请记住，您可能需要先订购/排序，具体取决于您的数据）

EDIT: As mentioned by the questioner, use df.groupby('id').head(2).reset_index(drop=True)to remove the multindex and flatten the results.

编辑：正如提问者所提到的，用于df.groupby('id').head(2).reset_index(drop=True)删除多重索引并使结果变平。

>>> df.groupby('id').head(2).reset_index(drop=True)
    id  value
0   1      1
1   1      2
2   2      1
3   2      2
4   3      1
5   4      1

Answer 2

回答by LondonRob

Since 0.14.1, you can now do nlargestand nsmalleston a groupbyobject:

从 0.14.1 开始，您现在可以在对象上执行nlargest和nsmallest操作groupby：

In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]: 
id   
1   2    3
    1    2
2   6    4
    5    3
3   7    1
4   8    1
dtype: int64

There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.

您在那里获得原始索引也有点奇怪，但这可能非常有用，具体取决于您的原始索引是什么。

If you're not interested in it, you can do .reset_index(level=1, drop=True)to get rid of it altogether.

如果你对它不感兴趣，你可以.reset_index(level=1, drop=True)完全摆脱它。

(Note: From 0.17.1you'll be able to do this on a DataFrameGroupBy too but for now it only works with Seriesand SeriesGroupBy.)

（注意：从 0.17.1 开始，您也可以在 DataFrameGroupBy上执行此操作，但目前它仅适用于Series和SeriesGroupBy。）

Answer 3

回答by Chaffee Chen

Sometimes sorting the whole data ahead is very time consuming. We can groupby first and doing topk for each group:

有时，提前对整个数据进行排序非常耗时。我们可以先分组，然后对每个组做topk：

g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)

Python Pandas 获得每组中最高的 n 条记录

提问by Roman Pekar

采纳答案by dorvak

回答by LondonRob

回答by Chaffee Chen

相关推荐

最近更新

标签

Python Pandas 获得每组中最高的 n 条记录

提问by Roman Pekar

采纳答案by dorvak

回答by LondonRob

回答by Chaffee Chen

相关推荐

Python 使用 Pandas 为字符串列中的每个值添加字符串前缀

Python：for循环 - 在同一行打印

Python 没有名为 flask.ext.wtf 的模块

Python Pandas 为所选列的行列最大值添加列

相关推荐

最近更新

标签