Python Pandas 的性能应用 vs np.vectorize 从现有列创建新列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/52673285/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Performance of Pandas apply vs np.vectorize to create new column from existing columns
提问by stackoverflowuser2010
I am using Pandas dataframes and want to create a new column as a function of existing columns. I have not seen a good discussion of the speed difference between df.apply()
and np.vectorize()
, so I thought I would ask here.
我正在使用 Pandas 数据框并希望创建一个新列作为现有列的函数。我还没有看到之间的速度差的一个很好的讨论df.apply()
和np.vectorize()
,所以我想我会问这里。
The Pandas apply()
function is slow. From what I measured (shown below in some experiments), using np.vectorize()
is 25x faster (or more) than using the DataFrame function apply()
, at least on my 2016 MacBook Pro. Is this an expected result, and why?
Pandasapply()
功能很慢。根据我的测量(在下面的一些实验中显示),使用np.vectorize()
比使用 DataFrame 函数快 25 倍(或更多)apply()
,至少在我 2016 年的 MacBook Pro 上是这样。这是预期的结果吗?为什么?
For example, suppose I have the following dataframe with N
rows:
例如,假设我有以下带有N
行的数据框:
N = 10
A_list = np.random.randint(1, 100, N)
B_list = np.random.randint(1, 100, N)
df = pd.DataFrame({'A': A_list, 'B': B_list})
df.head()
# A B
# 0 78 50
# 1 23 91
# 2 55 62
# 3 82 64
# 4 99 80
Suppose further that I want to create a new column as a function of the two columns A
and B
. In the example below, I'll use a simple function divide()
. To apply the function, I can use either df.apply()
or np.vectorize()
:
进一步假设我想创建一个新列作为两列A
和的函数B
。在下面的示例中,我将使用一个简单的函数divide()
。要应用该功能,我可以使用df.apply()
或np.vectorize()
:
def divide(a, b):
if b == 0:
return 0.0
return float(a)/b
df['result'] = df.apply(lambda row: divide(row['A'], row['B']), axis=1)
df['result2'] = np.vectorize(divide)(df['A'], df['B'])
df.head()
# A B result result2
# 0 78 50 1.560000 1.560000
# 1 23 91 0.252747 0.252747
# 2 55 62 0.887097 0.887097
# 3 82 64 1.281250 1.281250
# 4 99 80 1.237500 1.237500
If I increase N
to real-world sizes like 1 million or more, then I observe that np.vectorize()
is 25x faster or more than df.apply()
.
如果我增加到N
100 万或更多的真实世界大小,那么我观察到它np.vectorize()
比 快 25 倍或更多df.apply()
。
Below is some complete benchmarking code:
下面是一些完整的基准测试代码:
import pandas as pd
import numpy as np
import time
def divide(a, b):
if b == 0:
return 0.0
return float(a)/b
for N in [1000, 10000, 100000, 1000000, 10000000]:
print ''
A_list = np.random.randint(1, 100, N)
B_list = np.random.randint(1, 100, N)
df = pd.DataFrame({'A': A_list, 'B': B_list})
start_epoch_sec = int(time.time())
df['result'] = df.apply(lambda row: divide(row['A'], row['B']), axis=1)
end_epoch_sec = int(time.time())
result_apply = end_epoch_sec - start_epoch_sec
start_epoch_sec = int(time.time())
df['result2'] = np.vectorize(divide)(df['A'], df['B'])
end_epoch_sec = int(time.time())
result_vectorize = end_epoch_sec - start_epoch_sec
print 'N=%d, df.apply: %d sec, np.vectorize: %d sec' % \
(N, result_apply, result_vectorize)
# Make sure results from df.apply and np.vectorize match.
assert(df['result'].equals(df['result2']))
The results are shown below:
结果如下所示:
N=1000, df.apply: 0 sec, np.vectorize: 0 sec
N=10000, df.apply: 1 sec, np.vectorize: 0 sec
N=100000, df.apply: 2 sec, np.vectorize: 0 sec
N=1000000, df.apply: 24 sec, np.vectorize: 1 sec
N=10000000, df.apply: 262 sec, np.vectorize: 4 sec
If np.vectorize()
is in general always faster than df.apply()
, then why is np.vectorize()
not mentioned more? I only ever see StackOverflow posts related to df.apply()
, such as:
如果np.vectorize()
通常总是比 快df.apply()
,那么为什么np.vectorize()
不提及更多?我只看到过与 相关的 StackOverflow 帖子df.apply()
,例如:
pandas create new column based on values from other columns
How do I use Pandas 'apply' function to multiple columns?
回答by jpp
I will startby saying that the power of Pandas and NumPy arrays is derived from high-performance vectorisedcalculations on numeric arrays.1The entire point of vectorised calculations is to avoid Python-level loops by moving calculations to highly optimised C code and utilising contiguous memory blocks.2
我首先要说的是 Pandas 和 NumPy 数组的强大功能来自对数值数组的高性能矢量化计算。1矢量化计算的全部意义在于通过将计算转移到高度优化的 C 代码并利用连续的内存块来避免 Python 级循环。2
Python-level loops
Python 级循环
Now we can look at some timings. Below are allPython-level loops which produce either pd.Series
, np.ndarray
or list
objects containing the same values. For the purposes of assignment to a series within a dataframe, the results are comparable.
现在我们可以看看一些时间。下面是所有Python 级别的循环,它们生成包含相同值的pd.Series
,np.ndarray
或list
对象。为了分配给数据框中的系列,结果具有可比性。
# Python 3.6.5, NumPy 1.14.3, Pandas 0.23.0
np.random.seed(0)
N = 10**5
%timeit list(map(divide, df['A'], df['B'])) # 43.9 ms
%timeit np.vectorize(divide)(df['A'], df['B']) # 48.1 ms
%timeit [divide(a, b) for a, b in zip(df['A'], df['B'])] # 49.4 ms
%timeit [divide(a, b) for a, b in df[['A', 'B']].itertuples(index=False)] # 112 ms
%timeit df.apply(lambda row: divide(*row), axis=1, raw=True) # 760 ms
%timeit df.apply(lambda row: divide(row['A'], row['B']), axis=1) # 4.83 s
%timeit [divide(row['A'], row['B']) for _, row in df[['A', 'B']].iterrows()] # 11.6 s
Some takeaways:
一些要点:
- The
tuple
-based methods (the first 4) are a factor more efficient thanpd.Series
-based methods (the last 3). np.vectorize
, list comprehension +zip
andmap
methods, i.e. the top 3, all have roughly the same performance. This is because they usetuple
andbypass some Pandas overhead frompd.DataFrame.itertuples
.- There is a significant speed improvement from using
raw=True
withpd.DataFrame.apply
versus without. This option feeds NumPy arrays to the custom function instead ofpd.Series
objects.
- 的
tuple
基的方法(第一4)是一个因素比更有效的pd.Series
基于方法(最后3)。 np.vectorize
, list comprehension +zip
和map
methods,也就是前3个,性能都差不多。这是因为他们使用tuple
并绕过了一些 Pandas 开销pd.DataFrame.itertuples
。- 与不使用相比,使用
raw=True
有显着的速度改进pd.DataFrame.apply
。此选项将 NumPy 数组提供给自定义函数而不是pd.Series
对象。
pd.DataFrame.apply
: just another loop
pd.DataFrame.apply
: 只是另一个循环
To see exactlythe objects Pandas passes around, you can amend your function trivially:
要准确查看Pandas 传递的对象,您可以简单地修改您的函数:
def foo(row):
print(type(row))
assert False # because you only need to see this once
df.apply(lambda row: foo(row), axis=1)
Output: <class 'pandas.core.series.Series'>
. Creating, passing and querying a Pandas series object carries significant overheads relative to NumPy arrays. This shouldn't be surprise: Pandas series include a decent amount of scaffolding to hold an index, values, attributes, etc.
输出:<class 'pandas.core.series.Series'>
。相对于 NumPy 数组,创建、传递和查询 Pandas 系列对象会带来大量开销。这应该不足为奇:Pandas 系列包括相当数量的脚手架来保存索引、值、属性等。
Do the same exercise again with raw=True
and you'll see <class 'numpy.ndarray'>
. All this is described in the docs, but seeing it is more convincing.
用 再次做同样的练习raw=True
,你会看到<class 'numpy.ndarray'>
。所有这些都在文档中进行了描述,但看到它更有说服力。
np.vectorize
: fake vectorisation
np.vectorize
: 假矢量化
The docs for np.vectorize
has the following note:
的文档np.vectorize
有以下说明:
The vectorized function evaluates
pyfunc
over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.
矢量化函数
pyfunc
像 python map 函数一样对输入数组的连续元组进行评估,除了它使用 numpy 的广播规则。
The "broadcasting rules" are irrelevant here, since the input arrays have the same dimensions. The parallel to map
is instructive, since the map
version above has almost identical performance. The source codeshows what's happening: np.vectorize
converts your input function into a Universal function("ufunc") via np.frompyfunc
. There is some optimisation, e.g. caching, which can lead to some performance improvement.
“广播规则”在这里无关紧要,因为输入数组具有相同的维度。平行map
是有启发性的,因为map
上面的版本具有几乎相同的性能。该源代码显示发生的事情:np.vectorize
你的输入函数转换成通用的功能通过(“ufunc”) np.frompyfunc
。有一些优化,例如缓存,可以导致一些性能改进。
In short, np.vectorize
does what a Python-level loop shoulddo, but pd.DataFrame.apply
adds a chunky overhead. There's no JIT-compilation which you see with numba
(see below). It's just a convenience.
简而言之,np.vectorize
做了 Python 级循环应该做的事情,但pd.DataFrame.apply
增加了大量开销。没有您看到的 JIT 编译numba
(见下文)。这只是一种方便。
True vectorisation: what you shoulduse
真正的矢量化:你应该使用什么
Why aren't the above differences mentioned anywhere? Because the performance of truly vectorised calculations make them irrelevant:
为什么在任何地方都没有提到上述差异?因为真正矢量化计算的性能使它们变得无关紧要:
%timeit np.where(df['B'] == 0, 0, df['A'] / df['B']) # 1.17 ms
%timeit (df['A'] / df['B']).replace([np.inf, -np.inf], 0) # 1.96 ms
Yes, that's ~40x faster than the fastest of the above loopy solutions. Either of these are acceptable. In my opinion, the first is succinct, readable and efficient. Only look at other methods, e.g. numba
below, if performance is critical and this is part of your bottleneck.
是的,这比上述循环解决方案中最快的速度快约 40 倍。其中任何一个都是可以接受的。在我看来,第一个是简洁、可读和高效的。numba
如果性能至关重要并且这是瓶颈的一部分,则仅查看其他方法,例如下面的方法。
numba.njit
: greater efficiency
numba.njit
: 更高的效率
When loops areconsidered viable they are usually optimised via numba
with underlying NumPy arrays to move as much as possible to C.
当循环被认为可行时,它们通常通过numba
底层 NumPy 数组进行优化,以尽可能多地移动到 C。
Indeed, numba
improves performance to microseconds. Without some cumbersome work, it will be difficult to get much more efficient than this.
事实上,将numba
性能提高到微秒。如果没有一些繁琐的工作,将很难获得比这更高的效率。
from numba import njit
@njit
def divide(a, b):
res = np.empty(a.shape)
for i in range(len(a)):
if b[i] != 0:
res[i] = a[i] / b[i]
else:
res[i] = 0
return res
%timeit divide(df['A'].values, df['B'].values) # 717 μs
Using @njit(parallel=True)
may provide a further boost for larger arrays.
使用@njit(parallel=True)
可以为更大的阵列提供进一步的推动。
1Numeric types include: int
, float
, datetime
, bool
, category
. They excludeobject
dtype and can be held in contiguous memory blocks.
1数字类型包括:int
、float
、datetime
、bool
、category
。它们不包括object
dtype 并且可以保存在连续的内存块中。
2There are at least 2 reasons why NumPy operations are efficient versus Python:
2NumPy 操作比 Python 高效的原因至少有 2 个:
- Everything in Python is an object. This includes, unlike C, numbers. Python types therefore have an overhead which does not exist with native C types.
- NumPy methods are usually C-based. In addition, optimised algorithms are used where possible.
- Python 中的一切都是对象。与 C 不同,这包括数字。因此,Python 类型具有本机 C 类型不存在的开销。
- NumPy 方法通常基于 C。此外,尽可能使用优化算法。
回答by PMende
The more complex your functions get (i.e., the less numpy
can move to its own internals), the more you will see that the performance won't be that different. For example:
您的函数越复杂(即,numpy
移至其内部的可能性越小),您就越会发现性能不会有太大差异。例如:
name_series = pd.Series(np.random.choice(['adam', 'chang', 'eliza', 'odom'], replace=True, size=100000))
def parse_name(name):
if name.lower().startswith('a'):
return 'A'
elif name.lower().startswith('e'):
return 'E'
elif name.lower().startswith('i'):
return 'I'
elif name.lower().startswith('o'):
return 'O'
elif name.lower().startswith('u'):
return 'U'
return name
parse_name_vec = np.vectorize(parse_name)
Doing some timings:
做一些计时:
Using Apply
使用应用
%timeit name_series.apply(parse_name)
Results:
结果:
76.2 ms ± 626 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using np.vectorize
使用 np.vectorize
%timeit parse_name_vec(name_series)
Results:
结果:
77.3 ms ± 216 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy tries to turn python functions into numpy ufunc
objects when you call np.vectorize
. How it does this, I don't actually know - you'd have to dig more into the internals of numpy than I'm willing to ATM. That said, it seems to do a better job on simply numerical functions than this string-based function here.
NumPy的试图扭转蟒蛇功能为numpy的ufunc
对象,当你调用np.vectorize
。它是如何做到这一点的,我实际上并不知道 - 你必须比我愿意 ATM 更深入地挖掘 numpy 的内部结构。也就是说,与这里的基于字符串的函数相比,它似乎在简单的数字函数上做得更好。
Cranking the size up to 1,000,000:
将大小设置为 1,000,000:
name_series = pd.Series(np.random.choice(['adam', 'chang', 'eliza', 'odom'], replace=True, size=1000000))
apply
apply
%timeit name_series.apply(parse_name)
Results:
结果:
769 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.vectorize
np.vectorize
%timeit parse_name_vec(name_series)
Results:
结果:
794 ms ± 4.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A better (vectorized) way with np.select
:
更好的(矢量化)方式np.select
:
cases = [
name_series.str.lower().str.startswith('a'), name_series.str.lower().str.startswith('e'),
name_series.str.lower().str.startswith('i'), name_series.str.lower().str.startswith('o'),
name_series.str.lower().str.startswith('u')
]
replacements = 'A E I O U'.split()
Timings:
时间:
%timeit np.select(cases, replacements, default=name_series)
Results:
结果:
67.2 ms ± 683 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)