pandas 从熊猫数据框中选择排序组的第一行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42181022/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:57:39  来源:igfitidea点击:

Selecting the first row of a sorted group from pandas data frame

pythonpandasnumpydataframegroup-by

提问by user1330974

Suppose, I have a dataframe in pandas like below:

假设,我在 Pandas 中有一个数据框,如下所示:

campaignname    category_type    amount
A               cat_A_0            2.0
A               cat_A_1            1.0
A               cat_A_2            3.0
A               cat_A_2            3.0
A               cat_A_2            4.0
B               cat_B_0            3.0
C               cat_C_0            1.0
C               cat_C_1            2.0

I am using the following code to group the above dataframe (say it's assigned variable name df) by different columns as follows:

我正在使用以下代码df按不同的列对上述数据框(假设它已分配变量名称)进行分组,如下所示:

for name, gp in df.groupby('campaignname'):
    sorted_gp = gp.groupby(['campaignname', 'category_type']).sum().sort_values(['amount'], ascending=False)
    # I'd like to know how to select this in a cleaner/more concise way
    first_row = [sorted_gp.iloc[0].name[0], sorted_gp.iloc[0].name[1], sorted_gp.iloc[0].values.tolist()[0]]

The purpose of the above code is to first groupbythe raw data on campaignnamecolumn, then in each of the resulting group, I'd like to group again by both campaignnameand category_type, and finally, sort by amountcolumn to choose the first row that comes up (the one with the highest amountin each group. Specifically for the above example, I'd like to get results like this:

上面代码的目的是首先列groupby上的原始数据campaignname,然后在每个结果组中,我想再次按campaignname和分组category_type,最后按amount列排序以选择出现的第一行(amount每个组中最高的一个。特别是对于上面的例子,我想得到这样的结果:

first_row = ['A', 'cat_A_2', 4.0] # for the first group
first_row = ['B', 'cat_B_0', 3.0] # for the second group
first_row = ['C', 'cat_C_1', 2.0] # for the third group

etc.

等等。

As you can see, I'm using a rather (in my opinion) 'ugly' way to retrieve the first row of each sorted group, but since I'm new to pandas, I don't know a better/cleaner way to accomplish this. If anyone could let me know a way to select the first row in a sorted group from a pandas dataframe, I'd greatly appreciate it. Thank you in advance for your answers/suggestions!

如您所见,我正在使用一种(在我看来)“丑陋”的方式来检索每个排序组的第一行,但是由于我是大Pandas的新手,我不知道更好/更清洁的方法做到这一点。如果有人能让我知道从 Pandas 数据框中选择排序组中第一行的方法,我将不胜感激。预先感谢您的回答/建议!

回答by MaxU

IIUC you can do it this way:

IIUC 你可以这样做:

In [83]: df.groupby('campaignname', as_index=False) \
           .apply(lambda x: x.nlargest(1, columns=['amount'])) \
           .reset_index(level=1, drop=1)
Out[83]:
  campaignname category_type  amount
0            A       cat_A_2     4.0
1            B       cat_B_0     3.0
2            C       cat_C_1     2.0

or:

或者:

In [76]: df.sort_values('amount', ascending=False).groupby('campaignname').head(1)
Out[76]:
  campaignname category_type  amount
4            A       cat_A_2     4.0
5            B       cat_B_0     3.0
7            C       cat_C_1     2.0

回答by piRSquared

My preferred way to do this is with idxmax. It returns the index of the maximum value. I subsequently use that index to slice df

我的首选方法是使用idxmax. 它返回最大值的索引。我随后使用该索引进行切片df

df.loc[df.groupby('campaignname').amount.idxmax()]

  campaignname category_type  amount
4            A       cat_A_2     4.0
5            B       cat_B_0     3.0
7            C       cat_C_1     2.0