Pandas - rank() 函数的替代方法,为列提供唯一的序数等级
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39707080/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas - Alternative to rank() function that gives unique ordinal ranks for a column
提问by Fluous
At this moment I am writing a Python script that aggregates data from multiple Excel sheets. The module I choose to use is Pandas, because of its speed and ease of use with Excel files. The question is only related to the use of Pandas and me trying to create a additional column that contains unique, integer-only, ordinalranks within a group.
目前我正在编写一个 Python 脚本,用于聚合来自多个 Excel 工作表的数据。我选择使用的模块是 Pandas,因为它的速度和 Excel 文件的易用性。该问题仅与 Pandas 的使用有关,我试图创建一个额外的列,该列包含组内唯一的、仅限整数的、有序的等级。
My Python and Pandas knowledge is limited as I am just a beginner.
我的 Python 和 Pandas 知识有限,因为我只是一个初学者。
The Goal
目标
I am trying to achieve the following data structure. Where the top 10 adwords ads are ranked vertically on the basis of their position in Google. In order to do this I need to create a column in the original data (see Table 2 & 3) with a integer-only ranking that contains no duplicate values.
我正在尝试实现以下数据结构。排名前 10 的 AdWords 广告根据其在 Google 中的位置进行垂直排名。为了做到这一点,我需要在原始数据中创建一列(参见表 2 和 3),其中包含不包含重复值的仅整数排名。
Table 1: Data structure I am trying to achieve
表 1:我试图实现的数据结构
device , weeks , rank_1 , rank_2 , rank_3 , rank_4 , rank_5
mobile , wk 1 , string , string , string , string , string
mobile , wk 2 , string , string , string , string , string
computer, wk 1 , string , string , string , string , string
computer, wk 2 , string , string , string , string , string
The Problem
问题
The exact problem I run into is not being able to efficiently rank the rows with pandas. I have tried a number of things, but I cannot seem to get it ranked in this way.
我遇到的确切问题是无法使用 Pandas 有效地对行进行排名。我尝试了很多东西,但我似乎无法以这种方式对其进行排名。
Table 2: Data structure I have
表 2:我拥有的数据结构
weeks device , website , ranking , adtext
wk 1 mobile , url1 , *2.1 , string
wk 1 mobile , url2 , *2.1 , string
wk 1 mobile , url3 , 1.0 , string
wk 1 mobile , url4 , 2.9 , string
wk 1 desktop , *url5 , 2.1 , string
wk 1 desktop , url2 , *1.5 , string
wk 1 desktop , url3 , *1.5 , string
wk 1 desktop , url4 , 2.9 , string
wk 2 mobile , url1 , 2.0 , string
wk 2 mobile , *url6 , 2.1 , string
wk 2 mobile , url3 , 1.0 , string
wk 2 mobile , url4 , 2.9 , string
wk 2 desktop , *url5 , 2.1 , string
wk 2 desktop , url2 , *2.9 , string
wk 2 desktop , url3 , 1.0 , string
wk 2 desktop , url4 , *2.9 , string
Table 3: The table I cannot seem to create
表 3:我似乎无法创建的表
weeks device , website , ranking , adtext , ranking
wk 1 mobile , url1 , *2.1 , string , 2
wk 1 mobile , url2 , *2.1 , string , 3
wk 1 mobile , url3 , 1.0 , string , 1
wk 1 mobile , url4 , 2.9 , string , 4
wk 1 desktop , *url5 , 2.1 , string , 3
wk 1 desktop , url2 , *1.5 , string , 1
wk 1 desktop , url3 , *1.5 , string , 2
wk 1 desktop , url4 , 2.9 , string , 4
wk 2 mobile , url1 , 2.0 , string , 2
wk 2 mobile , *url6 , 2.1 , string , 3
wk 2 mobile , url3 , 1.0 , string , 1
wk 2 mobile , url4 , 2.9 , string , 4
wk 2 desktop , *url5 , 2.1 , string , 2
wk 2 desktop , url2 , *2.9 , string , 3
wk 2 desktop , url3 , 1.0 , string , 1
wk 2 desktop , url4 , *2.9 , string , 4
The standard .rank(ascending=True), gives averages on duplicate values. But since I use these ranks to organize them vertically this does not work out.
标准 .rank(ascending=True) 给出重复值的平均值。但是因为我使用这些等级来垂直组织它们,所以这是行不通的。
df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True])
df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( ascending=True)
The .rank(method="dense", ascending=True) maintains duplicate values and also does not solve my problem
.rank(method="dense", Ascending=True) 维护重复值,也没有解决我的问题
df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True])
df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( method="dense", ascending=True)
The .rank(method="first", ascending=True) throws a ValueError
.rank(method="first", Ascending=True) 抛出 ValueError
df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True])
df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( method="first", ascending=True)
ADDENDUM: If I would find a way to add the rankings in a column, I would then use pivot to transpose the table in the following way.
附录:如果我想找到一种在列中添加排名的方法,那么我将使用数据透视按以下方式转置表格。
df = pd.pivot_table(df, index = ['device', 'weeks'], columns='website', values='adtext', aggfunc=lambda x: ' '.join(x))
My question to you
我对你的问题
I was hoping any of you could help me find a solution for this problem. This could either an efficient ranking script or something else to help me reach the final data structure.
我希望你们中的任何人都可以帮助我找到解决此问题的方法。这可以是一个有效的排名脚本或其他帮助我达到最终数据结构的东西。
Thank you!
谢谢!
Sebastiaan
塞巴斯蒂安
EDIT: Unfortunately, I think I was not clear in my original post. I am looking for a ordinal ranking that only gives integers and has no duplicate values. This means that when there is a duplicate value it will randomly give one a higher ranking than the other.
编辑:不幸的是,我认为我在原始帖子中并不清楚。我正在寻找一个仅给出整数且没有重复值的序数排名。这意味着当存在重复值时,它会随机给出一个比另一个更高的排名。
So what I would like to do is generate a ranking that labels each row with an ordinal value per group. The groups are based on the week number and device. The reason I want to create a new column with this ranking is so that I can make top 10s per week and device.
所以我想做的是生成一个排名,用每个组的序数值标记每一行。这些组基于周数和设备。我想用这个排名创建一个新列的原因是,我可以每周和设备排名前 10 名。
Also Steven G asked me for an example to play around with. I have provided that here.
史蒂文 G 还让我举了一个例子来玩。我已经在这里提供了。
Example data can be pasted directly into python
示例数据可以直接粘贴到python中
! IMPORTANT: The names are different in this sample. The dataframe is called placeholder, the column names are as follows: 'week', 'website', 'share', 'rank_google', 'device'.
!重要提示:此示例中的名称不同。数据框称为占位符,列名称如下:'week'、'website'、'share'、'rank_google'、'device'。
data = {u'week': [u'WK 1', u'WK 2', u'WK 3', u'WK 4', u'WK 2', u'WK 2', u'WK 1',
u'WK 3', u'WK 4', u'WK 3', u'WK 3', u'WK 4', u'WK 2', u'WK 4', u'WK 1', u'WK 1',
u'WK3', u'WK 4', u'WK 4', u'WK 4', u'WK 4', u'WK 2', u'WK 1', u'WK 4', u'WK 4',
u'WK 4', u'WK 4', u'WK 2', u'WK 3', u'WK 4', u'WK 3', u'WK 4', u'WK 3', u'WK 2',
u'WK 2', u'WK 4', u'WK 1', u'WK 1', u'WK 4', u'WK 4', u'WK 2', u'WK 1', u'WK 3',
u'WK 1', u'WK 4', u'WK 1', u'WK 4', u'WK 2', u'WK 2', u'WK 2', u'WK 4', u'WK 4',
u'WK 4', u'WK 1', u'WK 3', u'WK 4', u'WK 4', u'WK 1', u'WK 4', u'WK 3', u'WK 2',
u'WK 4', u'WK 4', u'WK 4', u'WK 4', u'WK 1'],
u'website': [u'site1.nl', u'website2.de', u'site1.nl', u'site1.nl', u'anothersite.com',
u'url2.at', u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at',
u'url2.at', u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at',
u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at', u'url2.at',
u'anothersite.com', u'site2.co.uk', u'sitename2.com', u'sitename.co.uk', u'sitename.co.uk',
u'sitename2.com', u'sitename2.com', u'sitename2.com', u'url3.fi', u'sitename.co.uk',
u'sitename2.com', u'sitename.co.uk', u'sitename2.com', u'sitename2.com', u'ulr2.se',
u'sitename2.com', u'sitename.co.uk', u'sitename2.com', u'sitename2.com', u'sitename2.com',
u'sitename2.com', u'sitename2.com', u'sitename.co.uk', u'sitename.co.uk', u'sitename2.com',
u'facebook.com', u'alsoasite.com', u'ello.com', u'instagram.com', u'alsoasite.com', u'facebook.com',
u'facebook.com', u'singleboersen-vergleich.at', u'facebook.com', u'anothername.com', u'twitter.com',
u'alsoasite.com', u'alsoasite.com', u'alsoasite.com', u'alsoasite.com', u'facebook.com', u'alsoasite.com',
u'alsoasite.com'],
'adtext': [u'site1.nl 3,9 | < 10\xa0%', u'website2.de 1,4 | < 10\xa0%', u'site1.nl 4,3 | < 10\xa0%',
u'site1.nl 3,8 | < 10\xa0%', u'anothersite.com 2,5 | 12,36 %', u'url2.at 1,3 | 78,68 %', u'url2.at 1,2 | 92,58 %',
u'url2.at 1,1 | 85,47 %', u'url2.at 1,2 | 79,56 %', u'anothersite.com 2,8 | < 10\xa0%', u'url2.at 1,2 | 80,48 %',
u'url2.at 1,2 | 85,63 %', u'url2.at 1,1 | 88,36 %', u'url2.at 1,3 | 87,90 %', u'url2.at 1,1 | 83,70 %',
u'anothersite.com 3,1 | < 10\xa0%', u'url2.at 1,2 | 91,00 %', u'url2.at 1,1 | 92,11 %', u'url2.at 1,2 | 81,28 %'
, u'url2.at 1,1 | 86,49 %', u'anothersite.com 2,7 | < 10\xa0%', u'url2.at 1,2 | 83,96 %', u'url2.at 1,2 | 75,48 %'
, u'anothersite.com 3,0 | < 10\xa0%', u'site2.co.uk 3,1 | 16,24 %', u'sitename2.com 2,3 | 34,85 %',
u'sitename.co.uk 3,5 | < 10\xa0%', u'sitename.co.uk 3,6 | < 10\xa0%', u'sitename2.com 2,1 | < 10\xa0%',
u'sitename2.com 2,2 | 13,55 %', u'sitename2.com 2,1 | 47,91 %', u'url3.fi 3,4 | < 10\xa0%',
u'sitename.co.uk 3,1 | 14,15 %', u'sitename2.com 2,4 | 28,77 %', u'sitename.co.uk 3,1 | 22,55 %',
u'sitename2.com 2,1 | 17,03 %', u'sitename2.com 2,1 | 24,46 %', u'ulr2.se 2,7 | < 10\xa0%',
u'sitename2.com 2,0 | 49,12 %', u'sitename.co.uk 3,0 | < 10\xa0%', u'sitename2.com 2,1 | 40,00 %',
u'sitename2.com 2,1 | < 10\xa0%', u'sitename2.com 2,2 | 30,29 %', u'sitename2.com 2,0 |47,48 %',
u'sitename2.com 2,1 | 32,17 %', u'sitename.co.uk 3,2 | < 10\xa0%', u'sitename.co.uk 3,1 | 12,77 %',
u'sitename2.com 2,6 | < 10\xa0%', u'facebook.com 3,2 | < 10\xa0%', u'alsoasite.com 2,3 | < 10\xa0%',
u'ello.com 1,8 | < 10\xa0%',u'instagram.com 5,0 | < 10\xa0%', u'alsoasite.com 2,2 | < 10\xa0%',
u'facebook.com 3,0 | < 10\xa0%', u'facebook.com 3,2 | < 10\xa0%', u'singleboersen-vergleich.at 2,6 | < 10\xa0%',
u'facebook.com 3,4 | < 10\xa0%', u'anothername.com 1,9 | <10\xa0%', u'twitter.com 4,4 | < 10\xa0%',
u'alsoasite.com 1,1 | 12,35 %', u'alsoasite.com 1,1 | 11,22 %', u'alsoasite.com 2,0 | < 10\xa0%',
u'alsoasite.com 1,1| 10,86 %', u'facebook.com 3,4 | < 10\xa0%', u'alsoasite.com 1,1 | 10,82 %',
u'alsoasite.com 1,1 | < 10\xa0%'],
u'share': [u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'12,36 %', u'78,68 %',
u'92,58 %', u'85,47 %', u'79,56 %', u'< 10\xa0%', u'80,48 %', u'85,63 %', u'88,36 %',
u'87,90 %', u'83,70 %', u'< 10\xa0%', u'91,00 %', u'92,11 %', u'81,28 %', u'86,49 %',
u'< 10\xa0%', u'83,96 %', u'75,48 %', u'< 10\xa0%', u'16,24 %', u'34,85 %', u'< 10\xa0%',
u'< 10\xa0%', u'< 10\xa0%', u'13,55 %', u'47,91 %', u'< 10\xa0%', u'14,15 %', u'28,77 %',
u'22,55 %', u'17,03 %', u'24,46 %', u'< 10\xa0%', u'49,12 %', u'< 10\xa0%', u'40,00 %',
u'< 10\xa0%', u'30,29 %', u'47,48 %', u'32,17 %', u'< 10\xa0%', u'12,77 %', u'< 10\xa0%',
u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%',
u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'12,35 %', u'11,22 %', u'< 10\xa0%',
u'10,86 %', u'< 10\xa0%', u'10,82 %', u'< 10\xa0%'],
u'rank_google': [u'3,9', u'1,4', u'4,3', u'3,8', u'2,5', u'1,3', u'1,2', u'1,1', u'1,2', u'2,8',
u'1,2', u'1,2', u'1,1', u'1,3', u'1,1', u'3,1', u'1,2', u'1,1', u'1,2', u'1,1', u'2,7', u'1,2',
u'1,2', u'3,0', u'3,1', u'2,3', u'3,5', u'3,6', u'2,1', u'2,2', u'2,1', u'3,4', u'3,1', u'2,4',
u'3,1', u'2,1', u'2,1', u'2,7', u'2,0', u'3,0', u'2,1', u'2,1', u'2,2', u'2,0', u'2,1', u'3,2',
u'3,1', u'2,6', u'3,2', u'2,3', u'1,8', u'5,0', u'2,2', u'3,0', u'3,2', u'2,6', u'3,4', u'1,9',
u'4,4', u'1,1', u'1,1', u'2,0', u'1,1', u'3,4', u'1,1', u'1,1'],
u'device': [u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Tablet', u'Computer',
u'Mobile', u'Tablet', u'Mobile', u'Computer', u'Tablet', u'Tablet', u'Computer', u'Tablet', u'Tablet',
u'Tablet', u'Mobile', u'Computer', u'Tablet', u'Computer', u'Mobile', u'Tablet', u'Tablet', u'Mobile',
u'Tablet', u'Mobile', u'Computer', u'Computer', u'Tablet', u'Mobile', u'Tablet', u'Mobile', u'Tablet',
u'Mobile', u'Mobile', u'Mobile', u'Tablet', u'Computer', u'Tablet', u'Computer', u'Mobile', u'Tablet',
u'Tablet', u'Tablet', u'Mobile', u'Computer', u'Mobile', u'Computer', u'Tablet', u'Tablet', u'Tablet',
u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Computer',
u'Mobile', u'Tablet', u'Mobile', u'Mobile']}
placeholder = pd.DataFrame(data)
Error I receive when I use the rank() function with method='first'
当我使用带有 method='first' 的 rank() 函数时收到的错误
C:\Users\username\code\report-creator>python recomp-report-04.py
Traceback (most recent call last):
File "recomp-report-04.py", line 71, in <module>
placeholder['ranking'] = placeholder.groupby(['week', 'device'])['rank_googl
e'].rank(method='first').astype(int)
File "<string>", line 35, in rank
File "C:\Users\sthuis\AppData\Local\Continuum\Anaconda2\lib\site-packages\pand
as\core\groupby.py", line 561, in wrapper
raise ValueError
ValueError
My solution
我的解决方案
Effectively, the answer is given by @Nickil Maveli. A huge thank you! Nevertheless, I thought it might be smart to outline how I finally incorporated the solution.
实际上,答案由@Nickil Maveli 给出。非常感谢!尽管如此,我认为概述我最终如何合并解决方案可能是明智的。
Rank(method='first') is a good way to get an ordinal ranking. But since I was working with numbers that were formatted in the European way, pandas interpreted them as strings and could not rank them this way. I came to this conclusion by the reaction of Nickil Maveli and trying to rank each group individually. I did that through the following code.
Rank(method='first') 是获得序数排名的好方法。但是由于我使用的是以欧洲方式格式化的数字,pandas 将它们解释为字符串并且无法以这种方式对它们进行排序。我通过 Nickil Maveli 的反应得出了这个结论,并试图对每个组进行单独排名。我通过以下代码做到了这一点。
for name, group in df.sort_values(by='rank_google').groupby(['weeks', 'device']):
df['new_rank'] = group['ranking'].rank(method='first').astype(int)
This gave me the following error:
这给了我以下错误:
ValueError: first not supported for non-numeric data
So this helped me realize that I should convert the column to floats. This is how I did it.
所以这帮助我意识到我应该将列转换为浮点数。我就是这样做的。
# Converting the ranking column to a float
df['ranking'] = df['ranking'].apply(lambda x: float(unicode(x.replace(',','.'))))
# Creating a new column with a rank
df['new_rank'] = df.groupby(['weeks', 'device'])['ranking'].rank(method='first').astype(int)
# Dropping all ranks after the 10
df = df.sort_values('new_rank').groupby(['weeks', 'device']).head(n=10)
# Pivotting the column
df = pd.pivot_table(df, index = ['device', 'weeks'], columns='new_rank', values='adtext', aggfunc=lambda x: ' '.join(x))
# Naming the columns with 'top' + number
df.columns = ['top ' + str(i) for i in list(df.columns.values)]
So this worked for me. Thank you guys!
所以这对我有用。谢谢你们!
采纳答案by Nickil Maveli
I think the way you were trying to use the method=first
to rank them after sorting were causing problems.
我认为您method=first
在排序后尝试使用 对它们进行排名的方式导致了问题。
You could simply use the rank method with first
arg on the grouped object itself giving you the desired unique ranks per group.
您可以简单地first
在分组对象本身上使用带有arg的 rank 方法,为您提供每组所需的唯一等级。
df['new_rank'] = df.groupby(['weeks','device'])['ranking'].rank(method='first').astype(int)
print (df['new_rank'])
0 2
1 3
2 1
3 4
4 3
5 1
6 2
7 4
8 2
9 3
10 1
11 4
12 2
13 3
14 1
15 4
Name: new_rank, dtype: int32
Perform pivot operation:
执行枢轴操作:
df = df.pivot_table(index=['weeks', 'device'], columns=['new_rank'],
values=['adtext'], aggfunc=lambda x: ' '.join(x))
Choose the second level of the multiindex columns which pertain to the rank numbers:
选择与排名编号相关的多索引列的第二级:
df.columns = ['rank_' + str(i) for i in df.columns.get_level_values(1)]
df
Data:(to replicate)
数据:(复制)
df = pd.DataFrame({'weeks': ['wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1',
'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2'],
'device': ['mobile', 'mobile', 'mobile', 'mobile', 'desktop', 'desktop', 'desktop', 'desktop',
'mobile', 'mobile', 'mobile', 'mobile', 'desktop', 'desktop', 'desktop', 'desktop'],
'website': ['url1', 'url2', 'url3', 'url4', 'url5', 'url2', 'url3', 'url4',
'url1', 'url16', 'url3', 'url4', 'url5', 'url2', 'url3', 'url4'],
'ranking': [2.1, 2.1, 1.0, 2.9, 2.1, 1.5, 1.5, 2.9,
2.0, 2.1, 1.0, 2.9, 2.1, 2.9, 1.0, 2.9],
'adtext': ['string', 'string', 'string', 'string', 'string', 'string', 'string', 'string',
'string', 'string', 'string', 'string', 'string', 'string', 'string', 'string']})
Note: method=first
assigns ranks in the order they appear in the array/series.
注意:method=first
按照它们在数组/系列中出现的顺序分配等级。