pandas 熊猫按多列排名
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41974374/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas rank by multiple columns
提问by Anoop
I am trying to rank a pandas data frame based on two columns. I can rank it based on one column, but how can to rank it based on two columns? 'SaleCount', then 'TotalRevenue'?
我正在尝试根据两列对Pandas数据框进行排名。我可以根据一列对其进行排名,但是如何根据两列对其进行排名?'SaleCount',然后是 'TotalRevenue'?
import pandas as pd
df = pd.DataFrame({'TotalRevenue':[300,9000,1000,750,500,2000,0,600,50,500],
'Date':['2016-12-02' for i in range(10)],
'SaleCount':[10,100,30,35,20,100,0,30,2,20],
'shops':['S3','S2','S1','S5','S4','S8','S6','S7','S9','S10']})
df['Rank'] = df.SaleCount.rank(method='dense',ascending = False).astype(int)
#df['Rank'] = df.TotalRevenue.rank(method='dense',ascending = False).astype(int)
df.sort_values(['Rank'], inplace=True)
print(df)
current output:
电流输出:
Date SaleCount TotalRevenue shops Rank
1 2016-12-02 100 9000 S2 1
5 2016-12-06 100 2000 S8 1
3 2016-12-04 35 750 S5 2
2 2016-12-03 30 1000 S1 3
7 2016-12-08 30 600 S7 3
9 2016-12-10 20 500 S10 4
4 2016-12-05 20 500 S4 4
0 2016-12-01 10 300 S3 5
8 2016-12-09 2 50 S9 6
6 2016-12-07 0 0 S6 7
I'm trying to generate an output like this:
我正在尝试生成这样的输出:
Date SaleCount TotalRevenue shops Rank
1 2016-12-02 100 9000 S2 1
5 2016-12-02 100 2000 S8 2
3 2016-12-02 35 750 S5 3
2 2016-12-02 30 1000 S1 4
7 2016-12-02 30 600 S7 5
9 2016-12-02 20 500 S10 6
4 2016-12-02 20 500 S4 6
0 2016-12-02 10 300 S3 7
8 2016-12-02 2 50 S9 8
6 2016-12-02 0 0 S6 9
回答by piRSquared
pd.factorize
will generate unique values for each unique element of a iterable. We only need to sort in the order we'd like, then factorize. In order to do multiple columns, we convert the sorted result to tuples.
pd.factorize
将为迭代的每个唯一元素生成唯一值。我们只需要按照我们想要的顺序排序,然后分解。为了做多列,我们将排序结果转换为元组。
cols = ['SaleCount', 'TotalRevenue']
tups = df[cols].sort_values(cols, ascending=False).apply(tuple, 1)
f, i = pd.factorize(tups)
factorized = pd.Series(f + 1, tups.index)
df.assign(Rank=factorized)
Date SaleCount TotalRevenue shops Rank
1 2016-12-02 100 9000 S2 1
5 2016-12-02 100 2000 S8 2
3 2016-12-02 35 750 S5 3
2 2016-12-02 30 1000 S1 4
7 2016-12-02 30 600 S7 5
4 2016-12-02 20 500 S4 6
9 2016-12-02 20 500 S10 6
0 2016-12-02 10 300 S3 7
8 2016-12-02 2 50 S9 8
6 2016-12-02 0 0 S6 9
回答by Nickil Maveli
Another way would be to type-cast both the columns of interest to str
and combine them by concatenating them. Convert these back to numerical values so that they could be differentiated based on their magnitude.
另一种方法是对感兴趣的列进行类型转换str
,并通过连接它们来组合它们。将这些转换回数值,以便根据它们的大小区分它们。
In method=dense
, ranks of duplicated values would remain unchanged. (Here: 6)
在 中method=dense
,重复值的等级将保持不变。(这里:6)
Since you want to rank these in their descending order, specifying ascending=False
in Series.rank()
would let you achieve the desired result.
由于您想按降序排列这些,指定ascending=False
inSeries.rank()
将使您获得所需的结果。
col1 = df["SaleCount"].astype(str)
col2 = df["TotalRevenue"].astype(str)
df['Rank'] = (col1+col2).astype(int).rank(method='dense', ascending=False).astype(int)
df.sort_values('Rank')
回答by B. M.
The generic way to do that is to group the desired fiels in a tuple, whatever the types.
这样做的通用方法是将所需的字段分组在一个元组中,无论类型如何。
df["Rank"] = df[["SaleCount","TotalRevenue"]].apply(tuple,axis=1)\
.rank(method='dense',ascending=False).astype(int)
df.sort_values("Rank")
TotalRevenue Date SaleCount shops Rank
1 9000 2016-12-02 100 S2 1
5 2000 2016-12-02 100 S8 2
3 750 2016-12-02 35 S5 3
2 1000 2016-12-02 30 S1 4
7 600 2016-12-02 30 S7 5
4 500 2016-12-02 20 S4 6
9 500 2016-12-02 20 S10 6
0 300 2016-12-02 10 S3 7
8 50 2016-12-02 2 S9 8
6 0 2016-12-02 0 S6 9
回答by ALollz
sort_values
+ GroupBy.ngroup
sort_values
+ GroupBy.ngroup
This will give the dense
ranking.
这将给出dense
排名。
Columns should be sorted in the desired order prior to the groupby. Specifying sort=False
within the groupby
then respects this sorting so that groups are labeled in the order they appear within the sorted DataFrame.
列应在 groupby 之前按所需顺序排序。sort=False
在groupby
then 中指定遵循此排序,以便按它们在已排序的 DataFrame 中出现的顺序标记组。
cols = ['SaleCount', 'TotalRevenue']
df['Rank'] = df.sort_values(cols, ascending=False).groupby(cols, sort=False).ngroup() + 1
Output:
输出:
print(df.sort_values('Rank'))
TotalRevenue Date SaleCount shops Rank
1 9000 2016-12-02 100 S2 1
5 2000 2016-12-02 100 S8 2
3 750 2016-12-02 35 S5 3
2 1000 2016-12-02 30 S1 4
7 600 2016-12-02 30 S7 5
4 500 2016-12-02 20 S4 6
9 500 2016-12-02 20 S10 6
0 300 2016-12-02 10 S3 7
8 50 2016-12-02 2 S9 8
6 0 2016-12-02 0 S6 9
回答by smci
(The correct way to rank two (nonnegative) int columns is as per Nickil Maveli's answer, to cast them to string, concatenate them and cast back to int.)
(对两个(非负)int 列进行排名的正确方法是根据 Nickil Maveli 的回答,将它们转换为字符串,将它们连接起来并转换回 int。)
However here's a shortcut if you know that TotalRevenue
is constrained to some rangee.g. 0 to MAX_REVENUE=100,000 ; directly manipulate them as nonnegative integers:
但是,如果您知道它TotalRevenue
被限制在某个范围内,例如 0 到 MAX_REVENUE=100,000 ,那么这里有一个快捷方式;直接将它们作为非负整数操作:
df['Rank'] = (df['SaleCount']*MAX_REVENUE + df['TotalRevenue']).rank(method='dense', ascending=False).astype(int)
df.sort_values('Rank2')