Python 更快地对熊猫数据框中的子组中的行进行排名

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26720916/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:55:20  来源:igfitidea点击:

Faster way to rank rows in subgroups in pandas dataframe

pythonpandas

提问by captain ahab

I have a pandas data frame that has is composed of different subgroups.

我有一个由不同子组组成的熊猫数据框。

    df = pd.DataFrame({
    'id':[1, 2, 3, 4, 5, 6, 7, 8], 
    'group':['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], 
    'value':[.01, .4, .2, .3, .11, .21, .4, .01]
    })

I want to find the rank of each id in its group with say, lower values being better. In the example above, in group A, Id 1 would have a rank of 1, Id 2 would have a rank of 4. In group B, Id 5 would have a rank of 2, Id 8 would have a rank of 1 and so on.

我想找到每个 id 在其组中的排名,例如,值越低越好。在上面的例子中,在 A 组中,Id 1 的等级为 1,Id 2 的等级为 4。在 B 组中,Id 5 的等级为 2,Id 8 的等级为 1,依此类推在。

Right now I assess the ranks by:

现在我通过以下方式评估排名:

  1. Sorting by value.

    df.sort('value', ascending = True, inplace=True)

  2. Create a ranker function (it assumes variables already sorted)

    def ranker(df): df['rank'] = np.arange(len(df)) + 1 return df

  3. Apply the ranker function on each group separately:

    df = df.groupby(['group']).apply(ranker)

  1. 按值排序。

    df.sort('value', ascending = True, inplace=True)

  2. 创建一个排名函数(它假设变量已经排序)

    def ranker(df): df['rank'] = np.arange(len(df)) + 1 return df

  3. 分别对每个组应用 ranker 函数:

    df = df.groupby(['group']).apply(ranker)

This process works but it is really slow when I run it on millions of rows of data. Does anyone have any ideas on how to make a faster ranker function.

这个过程有效,但是当我在数百万行数据上运行它时它真的很慢。有没有人对如何制作更快的排名功能有任何想法。

采纳答案by Jeff

rank is cythonized so should be very fast. And you can pass the same options as df.rank()hereare the docs for rank. As you can see, tie-breaks can be done in one of five different ways via the methodargument.

rank 是 cythonized 所以应该非常快。您可以传递与df.rank()此处相同的选项,因为rank. 如您所见,可以通过method参数以五种不同方式之一进行抢七。

Its also possible you simply want the .cumcount()of the group.

也有可能你只是想要.cumcount()组的。

In [12]: df.groupby('group')['value'].rank(ascending=False)
Out[12]: 
0    4
1    1
2    3
3    2
4    3
5    2
6    1
7    4
dtype: float64

回答by Quentin Febvre

Working with a big DataFrame (13 million lines), the method rank with groupby maxed out my 8GB of RAM an it took a really long time. I found a workaround less greedy in memory , that I put here just in case:

使用大数据帧(1300 万行),使用 groupby 的方法排名最大化了我的 8GB RAM,这花了很长时间。我找到了一个不那么贪婪的解决方法,我把它放在这里以防万一:

df.sort_values('value')
tmp = df.groupby('group').size()
rank = tmp.map(range)
rank =[item for sublist in rank for item in sublist]
df['rank'] = rank