pandas 执行熊猫分组操作的更快替代方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/51975512/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Faster alternative to perform pandas groupby operation
提问by astrobiologist
I have a dataset with name (person_name), day and color (shirt_color) as columns.
我有一个以名称 (person_name)、日期和颜色 (shirt_color) 作为列的数据集。
Each person wears a shirt with a certain color on a particular day. The number of days can be arbitrary.
每个人在特定的一天都穿着某种颜色的衬衫。天数可以是任意的。
E.g. input:
例如输入:
name day color
----------------
John 1 White
John 2 White
John 3 Blue
John 4 Blue
John 5 White
Tom 2 White
Tom 3 Blue
Tom 4 Blue
Tom 5 Black
Jerry 1 Black
Jerry 2 Black
Jerry 4 Black
Jerry 5 White
I need to find the most frequently used color by each person.
我需要找到每个人最常用的颜色。
E.g. result:
例如结果:
name color
-------------
Jerry Black
John White
Tom Blue
I am performing the following operation to get the results, which works fine but is quite slow:
我正在执行以下操作以获得结果,它工作正常但速度很慢:
most_frquent_list = [[name, group.color.mode()[0]]
for name, group in data.groupby('name')]
most_frquent_df = pd.DataFrame(most_frquent_list, columns=['name', 'color'])
Now suppose I have a dataset with 5 million unique names. What is the best/fastest way to perform the above operation?
现在假设我有一个包含 500 万个唯一名称的数据集。执行上述操作的最佳/最快方法是什么?
回答by piRSquared
Numpy's numpy.add.at
and pandas.factorize
Numpynumpy.add.at
和pandas.factorize
This is intended to be fast. However, I tried to organize it to be readable as well.
这是为了快速。但是,我尝试将其组织为可读的。
i, r = pd.factorize(df.name)
j, c = pd.factorize(df.color)
n, m = len(r), len(c)
b = np.zeros((n, m), dtype=np.int64)
np.add.at(b, (i, j), 1)
pd.Series(c[b.argmax(1)], r)
John White
Tom Blue
Jerry Black
dtype: object
groupby
, size
, and idxmax
groupby
, size
, 和idxmax
df.groupby(['name', 'color']).size().unstack().idxmax(1)
name
Jerry Black
John White
Tom Blue
dtype: object
name
Jerry Black
John White
Tom Blue
Name: color, dtype: object
Counter
Counter
ˉ\_(ツ)_/ˉ
ˉ\_(ツ)_/ˉ
from collections import Counter
df.groupby('name').color.apply(lambda c: Counter(c).most_common(1)[0][0])
name
Jerry Black
John White
Tom Blue
Name: color, dtype: object
回答by DYZ
UPDATE
更新
It must be hard to beat this (~10 times faster on the sample daraframe than any proposed pandas solution and 1.5 faster than the proposed numpy solution). The gist is to stay away from pandas and use itertools.groupby
which is doing a much better job when it concerns non-numerical data.
必须很难击败它(示例 daraframe 比任何提议的 Pandas 解决方案快 10 倍,比提议的 numpy 解决方案快 1.5 倍)。要点是远离Pandas并使用itertools.groupby
它在涉及非数值数据时做得更好。
from itertools import groupby
from collections import Counter
pd.Series({x: Counter(z[-1] for z in y).most_common(1)[0][0] for x,y
in groupby(sorted(df.values.tolist()),
key=lambda x: x[0])})
# Jerry Black
# John White
# Tom Blue
Old Answer
旧答案
Here's another method. It is actually slower than the original one, but I'll keep it here:
这是另一种方法。它实际上比原来的慢,但我会保留在这里:
data.groupby('name')['color']\
.apply(pd.Series.value_counts)\
.unstack().idxmax(axis=1)
# name
# Jerry Black
# John White
# Tom Blue
回答by YOBEN_S
Solution from pd.Series.mode
解决方案来自 pd.Series.mode
df.groupby('name').color.apply(pd.Series.mode).reset_index(level=1,drop=True)
Out[281]:
name
Jerry Black
John White
Tom Blue
Name: color, dtype: object
回答by André C. Andersen
How about doing two groupings with transform(max)
?
用 做两个分组怎么样transform(max)
?
df = df.groupby(["name", "color"], as_index=False, sort=False).count()
idx = df.groupby("name", sort=False).transform(max)["day"] == df["day"]
df = df[idx][["name", "color"]].reset_index(drop=True)
Output:
输出:
name color
0 John White
1 Tom Blue
2 Jerry Black
回答by Deepak Saini
Similar to @piRSquared's pd.factorize
and np.add.at
ans.
类似于@piRSquaredpd.factorize
和np.add.at
ans。
We encode the stings in columns using
我们使用
i, r = pd.factorize(df.name)
j, c = pd.factorize(df.color)
n, m = len(r), len(c)
b = np.zeros((n, m), dtype=np.int64)
But then, instead of doing this:
但是,而不是这样做:
np.add.at(b, (i, j), 1)
max_columns_after_add_at = b.argmax(1)
We get the max_columns_after_add_at
using a jited function, to do add at and find maximum in the same loop:
我们max_columns_after_add_at
使用 jited 函数,在同一个循环中添加 at 并找到最大值:
@nb.jit(nopython=True, cache=True)
def add_at(x, rows, cols, val):
max_vals = np.zeros((x.shape[0], ), np.int64)
max_inds = np.zeros((x.shape[0], ), np.int64)
for i in range(len(rows)):
r = rows[i]
c = cols[i]
x[r, c]+=1
if(x[r, c] > max_vals[r]):
max_vals[r] = x[r, c]
max_inds[r] = c
return max_inds
And then get the dataframe in the end,
然后最后得到数据框,
ans = pd.Series(c[max_columns_after_add_at], r)
So, the difference is how we do argmax(axis=1) after np.add.at()
.
所以,不同之处在于我们如何做argmax(axis=1) after np.add.at()
。
Timing analysis
时序分析
import numpy as np
import numba as nb
m = 100000
n = 100000
rows = np.random.randint(low = 0, high = m, size=10000)
cols = np.random.randint(low = 0, high = n, size=10000)
So this:
所以这:
%%time
x = np.zeros((m,n))
np.add.at(x, (rows, cols), 1)
maxs = x.argmax(1)
gives:
给出:
CPU times: user 12.4 s, sys: 38 s, total: 50.4 s Wall time: 50.5 s
CPU 时间:用户 12.4 秒,系统:38 秒,总计:50.4 秒挂墙时间:50.5 秒
And this
和这个
%%time
x = np.zeros((m,n))
maxs2 = add_at(x, rows, cols, 1)
gives
给
CPU times: user 108 ms, sys: 39.4 s, total: 39.5 s Wall time: 38.4 s
CPU 时间:用户 108 毫秒,系统:39.4 秒,总计:39.5 秒挂墙时间:38.4 秒