pandas 执行熊猫分组操作的更快替代方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/51975512/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:58:22  来源:igfitidea点击:

Faster alternative to perform pandas groupby operation

pythonpandasnumpypandas-groupby

提问by astrobiologist

I have a dataset with name (person_name), day and color (shirt_color) as columns.

我有一个以名称 (person_name)、日期和颜色 (shirt_color) 作为列的数据集。

Each person wears a shirt with a certain color on a particular day. The number of days can be arbitrary.

每个人在特定的一天都穿着某种颜色的衬衫。天数可以是任意的。

E.g. input:

例如输入:

name  day  color
----------------
John   1   White
John   2   White
John   3   Blue
John   4   Blue
John   5   White
Tom    2   White
Tom    3   Blue
Tom    4   Blue
Tom    5   Black
Jerry  1   Black
Jerry  2   Black
Jerry  4   Black
Jerry  5   White

I need to find the most frequently used color by each person.

我需要找到每个人最常用的颜色。

E.g. result:

例如结果:

name    color
-------------
Jerry   Black
John    White
Tom     Blue

I am performing the following operation to get the results, which works fine but is quite slow:

我正在执行以下操作以获得结果,它工作正常但速度很慢:

most_frquent_list = [[name, group.color.mode()[0]] 
                        for name, group in data.groupby('name')]
most_frquent_df = pd.DataFrame(most_frquent_list, columns=['name', 'color'])

Now suppose I have a dataset with 5 million unique names. What is the best/fastest way to perform the above operation?

现在假设我有一个包含 500 万个唯一名称的数据集。执行上述操作的最佳/最快方法是什么?

回答by piRSquared

Numpy's numpy.add.atand pandas.factorize

Numpynumpy.add.atpandas.factorize

This is intended to be fast. However, I tried to organize it to be readable as well.

这是为了快速。但是,我尝试将其组织为可读的。

i, r = pd.factorize(df.name)
j, c = pd.factorize(df.color)
n, m = len(r), len(c)

b = np.zeros((n, m), dtype=np.int64)

np.add.at(b, (i, j), 1)
pd.Series(c[b.argmax(1)], r)

John     White
Tom       Blue
Jerry    Black
dtype: object


groupby, size, and idxmax

groupby, size, 和idxmax

df.groupby(['name', 'color']).size().unstack().idxmax(1)

name
Jerry    Black
John     White
Tom       Blue
dtype: object

name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object


Counter

Counter

ˉ\_(ツ)_/ˉ

ˉ\_(ツ)_/ˉ

from collections import Counter

df.groupby('name').color.apply(lambda c: Counter(c).most_common(1)[0][0])

name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object

回答by DYZ

UPDATE

更新

It must be hard to beat this (~10 times faster on the sample daraframe than any proposed pandas solution and 1.5 faster than the proposed numpy solution). The gist is to stay away from pandas and use itertools.groupbywhich is doing a much better job when it concerns non-numerical data.

必须很难击败它(示例 daraframe 比任何提议的 Pandas 解决方案快 10 倍,比提议的 numpy 解决方案快 1.5 倍)。要点是远离Pandas并使用itertools.groupby它在涉及非数值数据时做得更好。

from itertools import groupby
from collections import Counter

pd.Series({x: Counter(z[-1] for z in y).most_common(1)[0][0] for x,y 
          in groupby(sorted(df.values.tolist()), 
                            key=lambda x: x[0])})
# Jerry    Black
# John     White
# Tom       Blue

Old Answer

旧答案

Here's another method. It is actually slower than the original one, but I'll keep it here:

这是另一种方法。它实际上比原来的慢,但我会保留在这里:

data.groupby('name')['color']\
    .apply(pd.Series.value_counts)\
    .unstack().idxmax(axis=1)
# name
# Jerry    Black
# John     White
# Tom       Blue

回答by YOBEN_S

Solution from pd.Series.mode

解决方案来自 pd.Series.mode

df.groupby('name').color.apply(pd.Series.mode).reset_index(level=1,drop=True)
Out[281]: 
name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object

回答by André C. Andersen

How about doing two groupings with transform(max)?

用 做两个分组怎么样transform(max)

df = df.groupby(["name", "color"], as_index=False, sort=False).count()
idx = df.groupby("name", sort=False).transform(max)["day"] == df["day"]
df = df[idx][["name", "color"]].reset_index(drop=True)

Output:

输出:

    name  color
0   John  White
1    Tom   Blue
2  Jerry  Black

回答by Deepak Saini

Similar to @piRSquared's pd.factorizeand np.add.atans.

类似于@piRSquaredpd.factorizenp.add.atans。

We encode the stings in columns using

我们使用

i, r = pd.factorize(df.name)
j, c = pd.factorize(df.color)
n, m = len(r), len(c)
b = np.zeros((n, m), dtype=np.int64)

But then, instead of doing this:

但是,而不是这样做:

np.add.at(b, (i, j), 1)
max_columns_after_add_at = b.argmax(1)

We get the max_columns_after_add_atusing a jited function, to do add at and find maximum in the same loop:

我们max_columns_after_add_at使用 jited 函数,在同一个循环中添加 at 并找到最大值:

@nb.jit(nopython=True, cache=True)
def add_at(x, rows, cols, val):
    max_vals = np.zeros((x.shape[0], ), np.int64)
    max_inds = np.zeros((x.shape[0], ), np.int64)
    for i in range(len(rows)):
        r = rows[i]
        c = cols[i]
        x[r, c]+=1
        if(x[r, c] > max_vals[r]):
            max_vals[r] = x[r, c]
            max_inds[r] = c
    return max_inds

And then get the dataframe in the end,

然后最后得到数据框,

ans = pd.Series(c[max_columns_after_add_at], r)

So, the difference is how we do argmax(axis=1) after np.add.at().

所以,不同之处在于我们如何做argmax(axis=1) after np.add.at()

Timing analysis

时序分析

import numpy as np
import numba as nb
m = 100000
n = 100000
rows = np.random.randint(low = 0, high = m, size=10000)
cols = np.random.randint(low = 0, high = n, size=10000)

So this:

所以这:

%%time
x = np.zeros((m,n))
np.add.at(x, (rows, cols), 1)
maxs = x.argmax(1)

gives:

给出:

CPU times: user 12.4 s, sys: 38 s, total: 50.4 s Wall time: 50.5 s

CPU 时间:用户 12.4 秒,系统:38 秒,总计:50.4 秒挂墙时间:50.5 秒

And this

和这个

%%time
x = np.zeros((m,n))
maxs2 = add_at(x, rows, cols, 1)

gives

CPU times: user 108 ms, sys: 39.4 s, total: 39.5 s Wall time: 38.4 s

CPU 时间:用户 108 毫秒,系统:39.4 秒,总计:39.5 秒挂墙时间:38.4 秒