pandas 执行熊猫分组操作的更快替代方法

Question

提问by astrobiologist

I have a dataset with name (person_name), day and color (shirt_color) as columns.

我有一个以名称 (person_name)、日期和颜色 (shirt_color) 作为列的数据集。

Each person wears a shirt with a certain color on a particular day. The number of days can be arbitrary.

每个人在特定的一天都穿着某种颜色的衬衫。天数可以是任意的。

E.g. input:

例如输入：

name  day  color
----------------
John   1   White
John   2   White
John   3   Blue
John   4   Blue
John   5   White
Tom    2   White
Tom    3   Blue
Tom    4   Blue
Tom    5   Black
Jerry  1   Black
Jerry  2   Black
Jerry  4   Black
Jerry  5   White

I need to find the most frequently used color by each person.

我需要找到每个人最常用的颜色。

E.g. result:

例如结果：

name    color
-------------
Jerry   Black
John    White
Tom     Blue

I am performing the following operation to get the results, which works fine but is quite slow:

我正在执行以下操作以获得结果，它工作正常但速度很慢：

most_frquent_list = [[name, group.color.mode()[0]] 
                        for name, group in data.groupby('name')]
most_frquent_df = pd.DataFrame(most_frquent_list, columns=['name', 'color'])

Now suppose I have a dataset with 5 million unique names. What is the best/fastest way to perform the above operation?

现在假设我有一个包含 500 万个唯一名称的数据集。执行上述操作的最佳/最快方法是什么？

Answer 1

回答by piRSquared

Numpy's `numpy.add.at`and `pandas.factorize`

Numpy`numpy.add.at`和`pandas.factorize`

This is intended to be fast. However, I tried to organize it to be readable as well.

这是为了快速。但是，我尝试将其组织为可读的。

i, r = pd.factorize(df.name)
j, c = pd.factorize(df.color)
n, m = len(r), len(c)

b = np.zeros((n, m), dtype=np.int64)

np.add.at(b, (i, j), 1)
pd.Series(c[b.argmax(1)], r)

John     White
Tom       Blue
Jerry    Black
dtype: object

`groupby`, `size`, and `idxmax`

`groupby`, `size`, 和`idxmax`

df.groupby(['name', 'color']).size().unstack().idxmax(1)

name
Jerry    Black
John     White
Tom       Blue
dtype: object

name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object

`Counter`

ˉ\_(ツ)_/ˉ

from collections import Counter

df.groupby('name').color.apply(lambda c: Counter(c).most_common(1)[0][0])

name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object

Answer 2

回答by DYZ

UPDATE

更新

It must be hard to beat this (~10 times faster on the sample daraframe than any proposed pandas solution and 1.5 faster than the proposed numpy solution). The gist is to stay away from pandas and use itertools.groupbywhich is doing a much better job when it concerns non-numerical data.

必须很难击败它（示例 daraframe 比任何提议的 Pandas 解决方案快 10 倍，比提议的 numpy 解决方案快 1.5 倍）。要点是远离Pandas并使用itertools.groupby它在涉及非数值数据时做得更好。

from itertools import groupby
from collections import Counter

pd.Series({x: Counter(z[-1] for z in y).most_common(1)[0][0] for x,y 
          in groupby(sorted(df.values.tolist()), 
                            key=lambda x: x[0])})
# Jerry    Black
# John     White
# Tom       Blue

Old Answer

旧答案

Here's another method. It is actually slower than the original one, but I'll keep it here:

这是另一种方法。它实际上比原来的慢，但我会保留在这里：

data.groupby('name')['color']\
    .apply(pd.Series.value_counts)\
    .unstack().idxmax(axis=1)
# name
# Jerry    Black
# John     White
# Tom       Blue

Answer 3

回答by YOBEN_S

Solution from pd.Series.mode

解决方案来自 pd.Series.mode

df.groupby('name').color.apply(pd.Series.mode).reset_index(level=1,drop=True)
Out[281]: 
name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object

Answer 4

回答by André C. Andersen

How about doing two groupings with transform(max)?

用做两个分组怎么样transform(max)？

df = df.groupby(["name", "color"], as_index=False, sort=False).count()
idx = df.groupby("name", sort=False).transform(max)["day"] == df["day"]
df = df[idx][["name", "color"]].reset_index(drop=True)

Output:

输出：

    name  color
0   John  White
1    Tom   Blue
2  Jerry  Black

Answer 5

回答by Deepak Saini

Similar to @piRSquared's pd.factorizeand np.add.atans.

类似于@piRSquaredpd.factorize和np.add.atans。

We encode the stings in columns using

我们使用

i, r = pd.factorize(df.name)
j, c = pd.factorize(df.color)
n, m = len(r), len(c)
b = np.zeros((n, m), dtype=np.int64)

But then, instead of doing this:

但是，而不是这样做：

np.add.at(b, (i, j), 1)
max_columns_after_add_at = b.argmax(1)

We get the max_columns_after_add_atusing a jited function, to do add at and find maximum in the same loop:

我们max_columns_after_add_at使用 jited 函数，在同一个循环中添加 at 并找到最大值：

@nb.jit(nopython=True, cache=True)
def add_at(x, rows, cols, val):
    max_vals = np.zeros((x.shape[0], ), np.int64)
    max_inds = np.zeros((x.shape[0], ), np.int64)
    for i in range(len(rows)):
        r = rows[i]
        c = cols[i]
        x[r, c]+=1
        if(x[r, c] > max_vals[r]):
            max_vals[r] = x[r, c]
            max_inds[r] = c
    return max_inds

And then get the dataframe in the end,

然后最后得到数据框，

ans = pd.Series(c[max_columns_after_add_at], r)

So, the difference is how we do argmax(axis=1) after np.add.at().

所以，不同之处在于我们如何做argmax(axis=1) after np.add.at()。

Timing analysis

时序分析

import numpy as np
import numba as nb
m = 100000
n = 100000
rows = np.random.randint(low = 0, high = m, size=10000)
cols = np.random.randint(low = 0, high = n, size=10000)

So this:

所以这：

%%time
x = np.zeros((m,n))
np.add.at(x, (rows, cols), 1)
maxs = x.argmax(1)

gives:

给出：

CPU times: user 12.4 s, sys: 38 s, total: 50.4 s Wall time: 50.5 s

CPU 时间：用户 12.4 秒，系统：38 秒，总计：50.4 秒挂墙时间：50.5 秒

And this

和这个

%%time
x = np.zeros((m,n))
maxs2 = add_at(x, rows, cols, 1)

gives

给

CPU times: user 108 ms, sys: 39.4 s, total: 39.5 s Wall time: 38.4 s

CPU 时间：用户 108 毫秒，系统：39.4 秒，总计：39.5 秒挂墙时间：38.4 秒

pandas 执行熊猫分组操作的更快替代方法

提问by astrobiologist

回答by piRSquared

Numpy's `numpy.add.at`and `pandas.factorize`

Numpy`numpy.add.at`和`pandas.factorize`

`groupby`, `size`, and `idxmax`

`groupby`, `size`, 和`idxmax`

`Counter`

`Counter`

回答by DYZ

回答by YOBEN_S

回答by André C. Andersen

回答by Deepak Saini

相关推荐

最近更新

标签

pandas 执行熊猫分组操作的更快替代方法

提问by astrobiologist

回答by piRSquared

Numpy's numpy.add.atand pandas.factorize

Numpynumpy.add.at和pandas.factorize

groupby, size, and idxmax

groupby, size, 和idxmax

Counter

Counter

回答by DYZ

回答by YOBEN_S

回答by André C. Andersen

回答by Deepak Saini

相关推荐

pandas 如何为 DataFrame 中的每一列绘制箱线图？

pandas 用两个 y 轴绘制数据框

Pandas：如果条件从另一列更新列值

在 Pandas 0.23+ 中删除空列

相关推荐

最近更新

标签

Numpy's `numpy.add.at`and `pandas.factorize`

Numpy`numpy.add.at`和`pandas.factorize`

`groupby`, `size`, and `idxmax`

`groupby`, `size`, 和`idxmax`

`Counter`

`Counter`