pandas 熊猫按多列排名

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41974374/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:53:42  来源:igfitidea点击:

Pandas rank by multiple columns

pythonpython-3.xpandasrank

提问by Anoop

I am trying to rank a pandas data frame based on two columns. I can rank it based on one column, but how can to rank it based on two columns? 'SaleCount', then 'TotalRevenue'?

我正在尝试根据两列对Pandas数据框进行排名。我可以根据一列对其进行排名,但是如何根据两列对其进行排名?'SaleCount',然后是 'TotalRevenue'?

import pandas as pd

df = pd.DataFrame({'TotalRevenue':[300,9000,1000,750,500,2000,0,600,50,500],
    'Date':['2016-12-02' for i in range(10)],
    'SaleCount':[10,100,30,35,20,100,0,30,2,20],
    'shops':['S3','S2','S1','S5','S4','S8','S6','S7','S9','S10']})

df['Rank'] = df.SaleCount.rank(method='dense',ascending = False).astype(int)

#df['Rank'] = df.TotalRevenue.rank(method='dense',ascending = False).astype(int)
df.sort_values(['Rank'], inplace=True)

print(df)

current output:

电流输出:

    Date        SaleCount   TotalRevenue    shops   Rank
1   2016-12-02  100          9000            S2      1
5   2016-12-06  100          2000            S8      1
3   2016-12-04  35           750             S5      2
2   2016-12-03  30           1000            S1      3
7   2016-12-08  30           600             S7      3
9   2016-12-10  20           500             S10     4
4   2016-12-05  20           500             S4      4
0   2016-12-01  10           300             S3      5
8   2016-12-09  2            50              S9      6
6   2016-12-07  0            0               S6      7

I'm trying to generate an output like this:

我正在尝试生成这样的输出:

    Date        SaleCount   TotalRevenue    shops   Rank
1   2016-12-02  100          9000            S2      1
5   2016-12-02  100          2000            S8      2
3   2016-12-02  35           750             S5      3
2   2016-12-02  30           1000            S1      4
7   2016-12-02  30           600             S7      5
9   2016-12-02  20           500             S10     6
4   2016-12-02  20           500             S4      6
0   2016-12-02  10           300             S3      7
8   2016-12-02  2            50              S9      8
6   2016-12-02  0            0               S6      9

回答by piRSquared

pd.factorizewill generate unique values for each unique element of a iterable. We only need to sort in the order we'd like, then factorize. In order to do multiple columns, we convert the sorted result to tuples.

pd.factorize将为迭代的每个唯一元素生成唯一值。我们只需要按照我们想要的顺序排序,然后分解。为了做多列,我们将排序结果转换为元组。

cols = ['SaleCount', 'TotalRevenue']
tups = df[cols].sort_values(cols, ascending=False).apply(tuple, 1)
f, i = pd.factorize(tups)
factorized = pd.Series(f + 1, tups.index)

df.assign(Rank=factorized)

         Date  SaleCount  TotalRevenue shops  Rank
1  2016-12-02        100          9000    S2     1
5  2016-12-02        100          2000    S8     2
3  2016-12-02         35           750    S5     3
2  2016-12-02         30          1000    S1     4
7  2016-12-02         30           600    S7     5
4  2016-12-02         20           500    S4     6
9  2016-12-02         20           500   S10     6
0  2016-12-02         10           300    S3     7
8  2016-12-02          2            50    S9     8
6  2016-12-02          0             0    S6     9

回答by Nickil Maveli

Another way would be to type-cast both the columns of interest to strand combine them by concatenating them. Convert these back to numerical values so that they could be differentiated based on their magnitude.

另一种方法是对感兴趣的列进行类型转换str,并通过连接它们来组合它们。将这些转换回数值,以便根据它们的大小区分它们。

In method=dense, ranks of duplicated values would remain unchanged. (Here: 6)

在 中method=dense,重复值的等级将保持不变。(这里:6)

Since you want to rank these in their descending order, specifying ascending=Falsein Series.rank()would let you achieve the desired result.

由于您想按降序排列这些,指定ascending=FalseinSeries.rank()将使您获得所需的结果。

col1 = df["SaleCount"].astype(str) 
col2 = df["TotalRevenue"].astype(str)
df['Rank'] = (col1+col2).astype(int).rank(method='dense', ascending=False).astype(int)
df.sort_values('Rank')

enter image description here

在此处输入图片说明

回答by B. M.

The generic way to do that is to group the desired fiels in a tuple, whatever the types.

这样做的通用方法是将所需的字段分组在一个元组中,无论类型如何。

df["Rank"] = df[["SaleCount","TotalRevenue"]].apply(tuple,axis=1)\
             .rank(method='dense',ascending=False).astype(int)

df.sort_values("Rank")

   TotalRevenue        Date  SaleCount shops  Rank
1          9000  2016-12-02        100    S2     1
5          2000  2016-12-02        100    S8     2
3           750  2016-12-02         35    S5     3
2          1000  2016-12-02         30    S1     4
7           600  2016-12-02         30    S7     5
4           500  2016-12-02         20    S4     6
9           500  2016-12-02         20   S10     6
0           300  2016-12-02         10    S3     7
8            50  2016-12-02          2    S9     8
6             0  2016-12-02          0    S6     9

回答by ALollz

sort_values+ GroupBy.ngroup

sort_values+ GroupBy.ngroup

This will give the denseranking.

这将给出dense排名。

Columns should be sorted in the desired order prior to the groupby. Specifying sort=Falsewithin the groupbythen respects this sorting so that groups are labeled in the order they appear within the sorted DataFrame.

列应在 groupby 之前按所需顺序排序。sort=Falsegroupbythen 中指定遵循此排序,以便按它们在已排序的 DataFrame 中出现的顺序标记组。

cols = ['SaleCount', 'TotalRevenue']
df['Rank'] = df.sort_values(cols, ascending=False).groupby(cols, sort=False).ngroup() + 1


Output:

输出:

print(df.sort_values('Rank'))

   TotalRevenue        Date  SaleCount shops  Rank
1          9000  2016-12-02        100    S2     1
5          2000  2016-12-02        100    S8     2
3           750  2016-12-02         35    S5     3
2          1000  2016-12-02         30    S1     4
7           600  2016-12-02         30    S7     5
4           500  2016-12-02         20    S4     6
9           500  2016-12-02         20   S10     6
0           300  2016-12-02         10    S3     7
8            50  2016-12-02          2    S9     8
6             0  2016-12-02          0    S6     9

回答by smci

(The correct way to rank two (nonnegative) int columns is as per Nickil Maveli's answer, to cast them to string, concatenate them and cast back to int.)

(对两个(非负)int 列进行排名的正确方法是根据 Nickil Maveli 的回答,将它们转换为字符串,将它们连接起来并转换回 int。)

However here's a shortcut if you know that TotalRevenueis constrained to some rangee.g. 0 to MAX_REVENUE=100,000 ; directly manipulate them as nonnegative integers:

但是,如果您知道它TotalRevenue被限制在某个范围内,例如 0 到 MAX_REVENUE=100,000 ,那么这里有一个快捷方式;直接将它们作为非负整数操作:

df['Rank'] = (df['SaleCount']*MAX_REVENUE + df['TotalRevenue']).rank(method='dense', ascending=False).astype(int)

df.sort_values('Rank2')