Python Pandas 的性能应用 vs np.vectorize 从现有列创建新列

Question

提问by stackoverflowuser2010

I am using Pandas dataframes and want to create a new column as a function of existing columns. I have not seen a good discussion of the speed difference between df.apply()and np.vectorize(), so I thought I would ask here.

我正在使用 Pandas 数据框并希望创建一个新列作为现有列的函数。我还没有看到之间的速度差的一个很好的讨论df.apply()和np.vectorize()，所以我想我会问这里。

The Pandas apply()function is slow. From what I measured (shown below in some experiments), using np.vectorize()is 25x faster (or more) than using the DataFrame function apply(), at least on my 2016 MacBook Pro. Is this an expected result, and why?

Pandasapply()功能很慢。根据我的测量（在下面的一些实验中显示），使用np.vectorize()比使用 DataFrame 函数快 25 倍（或更多）apply()，至少在我 2016 年的 MacBook Pro 上是这样。这是预期的结果吗？为什么？

For example, suppose I have the following dataframe with Nrows:

例如，假设我有以下带有N行的数据框：

N = 10
A_list = np.random.randint(1, 100, N)
B_list = np.random.randint(1, 100, N)
df = pd.DataFrame({'A': A_list, 'B': B_list})
df.head()
#     A   B
# 0  78  50
# 1  23  91
# 2  55  62
# 3  82  64
# 4  99  80

Suppose further that I want to create a new column as a function of the two columns Aand B. In the example below, I'll use a simple function divide(). To apply the function, I can use either df.apply()or np.vectorize():

进一步假设我想创建一个新列作为两列A和的函数B。在下面的示例中，我将使用一个简单的函数divide()。要应用该功能，我可以使用df.apply()或np.vectorize()：

def divide(a, b):
    if b == 0:
        return 0.0
    return float(a)/b

df['result'] = df.apply(lambda row: divide(row['A'], row['B']), axis=1)

df['result2'] = np.vectorize(divide)(df['A'], df['B'])

df.head()
#     A   B    result   result2
# 0  78  50  1.560000  1.560000
# 1  23  91  0.252747  0.252747
# 2  55  62  0.887097  0.887097
# 3  82  64  1.281250  1.281250
# 4  99  80  1.237500  1.237500

If I increase Nto real-world sizes like 1 million or more, then I observe that np.vectorize()is 25x faster or more than df.apply().

如果我增加到N100 万或更多的真实世界大小，那么我观察到它np.vectorize()比快 25 倍或更多df.apply()。

Below is some complete benchmarking code:

下面是一些完整的基准测试代码：

import pandas as pd
import numpy as np
import time

def divide(a, b):
    if b == 0:
        return 0.0
    return float(a)/b

for N in [1000, 10000, 100000, 1000000, 10000000]:    

    print ''
    A_list = np.random.randint(1, 100, N)
    B_list = np.random.randint(1, 100, N)
    df = pd.DataFrame({'A': A_list, 'B': B_list})

    start_epoch_sec = int(time.time())
    df['result'] = df.apply(lambda row: divide(row['A'], row['B']), axis=1)
    end_epoch_sec = int(time.time())
    result_apply = end_epoch_sec - start_epoch_sec

    start_epoch_sec = int(time.time())
    df['result2'] = np.vectorize(divide)(df['A'], df['B'])
    end_epoch_sec = int(time.time())
    result_vectorize = end_epoch_sec - start_epoch_sec


    print 'N=%d, df.apply: %d sec, np.vectorize: %d sec' % \
            (N, result_apply, result_vectorize)

    # Make sure results from df.apply and np.vectorize match.
    assert(df['result'].equals(df['result2']))

The results are shown below:

结果如下所示：

N=1000, df.apply: 0 sec, np.vectorize: 0 sec

N=10000, df.apply: 1 sec, np.vectorize: 0 sec

N=100000, df.apply: 2 sec, np.vectorize: 0 sec

N=1000000, df.apply: 24 sec, np.vectorize: 1 sec

N=10000000, df.apply: 262 sec, np.vectorize: 4 sec

If np.vectorize()is in general always faster than df.apply(), then why is np.vectorize()not mentioned more? I only ever see StackOverflow posts related to df.apply(), such as:

如果np.vectorize()通常总是比快df.apply()，那么为什么np.vectorize()不提及更多？我只看到过与相关的 StackOverflow 帖子df.apply()，例如：

pandas create new column based on values from other columns

熊猫根据其他列的值创建新列

How do I use Pandas 'apply' function to multiple columns?

如何将 Pandas 的“应用”功能用于多列？

How to apply a function to two columns of Pandas dataframe

如何将函数应用于两列 Pandas 数据框

Answer 1

回答by jpp

I will startby saying that the power of Pandas and NumPy arrays is derived from high-performance vectorisedcalculations on numeric arrays.¹The entire point of vectorised calculations is to avoid Python-level loops by moving calculations to highly optimised C code and utilising contiguous memory blocks.²

我首先要说的是 Pandas 和 NumPy 数组的强大功能来自对数值数组的高性能矢量化计算。¹矢量化计算的全部意义在于通过将计算转移到高度优化的 C 代码并利用连续的内存块来避免 Python 级循环。²

Python-level loops

Python 级循环

Now we can look at some timings. Below are allPython-level loops which produce either pd.Series, np.ndarrayor listobjects containing the same values. For the purposes of assignment to a series within a dataframe, the results are comparable.

现在我们可以看看一些时间。下面是所有Python 级别的循环，它们生成包含相同值的pd.Series,np.ndarray或list对象。为了分配给数据框中的系列，结果具有可比性。

# Python 3.6.5, NumPy 1.14.3, Pandas 0.23.0

np.random.seed(0)
N = 10**5

%timeit list(map(divide, df['A'], df['B']))                                   # 43.9 ms
%timeit np.vectorize(divide)(df['A'], df['B'])                                # 48.1 ms
%timeit [divide(a, b) for a, b in zip(df['A'], df['B'])]                      # 49.4 ms
%timeit [divide(a, b) for a, b in df[['A', 'B']].itertuples(index=False)]     # 112 ms
%timeit df.apply(lambda row: divide(*row), axis=1, raw=True)                  # 760 ms
%timeit df.apply(lambda row: divide(row['A'], row['B']), axis=1)              # 4.83 s
%timeit [divide(row['A'], row['B']) for _, row in df[['A', 'B']].iterrows()]  # 11.6 s

Some takeaways:

一些要点：

The tuple-based methods (the first 4) are a factor more efficient than pd.Series-based methods (the last 3).
np.vectorize, list comprehension + zipand mapmethods, i.e. the top 3, all have roughly the same performance. This is because they use tupleandbypass some Pandas overhead from pd.DataFrame.itertuples.
There is a significant speed improvement from using raw=Truewith pd.DataFrame.applyversus without. This option feeds NumPy arrays to the custom function instead of pd.Seriesobjects.

的tuple基的方法（第一4）是一个因素比更有效的pd.Series基于方法（最后3）。
np.vectorize, list comprehension +zip和mapmethods，也就是前3个，性能都差不多。这是因为他们使用tuple并绕过了一些 Pandas 开销pd.DataFrame.itertuples。
与不使用相比，使用raw=True有显着的速度改进pd.DataFrame.apply。此选项将 NumPy 数组提供给自定义函数而不是pd.Series对象。

`pd.DataFrame.apply`: just another loop

`pd.DataFrame.apply`: 只是另一个循环

To see exactlythe objects Pandas passes around, you can amend your function trivially:

要准确查看Pandas 传递的对象，您可以简单地修改您的函数：

def foo(row):
    print(type(row))
    assert False  # because you only need to see this once
df.apply(lambda row: foo(row), axis=1)

Output: <class 'pandas.core.series.Series'>. Creating, passing and querying a Pandas series object carries significant overheads relative to NumPy arrays. This shouldn't be surprise: Pandas series include a decent amount of scaffolding to hold an index, values, attributes, etc.

输出：<class 'pandas.core.series.Series'>。相对于 NumPy 数组，创建、传递和查询 Pandas 系列对象会带来大量开销。这应该不足为奇：Pandas 系列包括相当数量的脚手架来保存索引、值、属性等。

Do the same exercise again with raw=Trueand you'll see <class 'numpy.ndarray'>. All this is described in the docs, but seeing it is more convincing.

用再次做同样的练习raw=True，你会看到<class 'numpy.ndarray'>。所有这些都在文档中进行了描述，但看到它更有说服力。

`np.vectorize`: fake vectorisation

`np.vectorize`: 假矢量化

The docs for np.vectorizehas the following note:

的文档np.vectorize有以下说明：

The vectorized function evaluates pyfuncover successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.

矢量化函数pyfunc像 python map 函数一样对输入数组的连续元组进行评估，除了它使用 numpy 的广播规则。

The "broadcasting rules" are irrelevant here, since the input arrays have the same dimensions. The parallel to mapis instructive, since the mapversion above has almost identical performance. The source codeshows what's happening: np.vectorizeconverts your input function into a Universal function("ufunc") via np.frompyfunc. There is some optimisation, e.g. caching, which can lead to some performance improvement.

“广播规则”在这里无关紧要，因为输入数组具有相同的维度。平行map是有启发性的，因为map上面的版本具有几乎相同的性能。该源代码显示发生的事情：np.vectorize你的输入函数转换成通用的功能通过（“ufunc”） np.frompyfunc。有一些优化，例如缓存，可以导致一些性能改进。

In short, np.vectorizedoes what a Python-level loop shoulddo, but pd.DataFrame.applyadds a chunky overhead. There's no JIT-compilation which you see with numba(see below). It's just a convenience.

简而言之，np.vectorize做了 Python 级循环应该做的事情，但pd.DataFrame.apply增加了大量开销。没有您看到的 JIT 编译numba（见下文）。这只是一种方便。

True vectorisation: what you shoulduse

真正的矢量化：你应该使用什么

Why aren't the above differences mentioned anywhere? Because the performance of truly vectorised calculations make them irrelevant:

为什么在任何地方都没有提到上述差异？因为真正矢量化计算的性能使它们变得无关紧要：

%timeit np.where(df['B'] == 0, 0, df['A'] / df['B'])       # 1.17 ms
%timeit (df['A'] / df['B']).replace([np.inf, -np.inf], 0)  # 1.96 ms

Yes, that's ~40x faster than the fastest of the above loopy solutions. Either of these are acceptable. In my opinion, the first is succinct, readable and efficient. Only look at other methods, e.g. numbabelow, if performance is critical and this is part of your bottleneck.

是的，这比上述循环解决方案中最快的速度快约 40 倍。其中任何一个都是可以接受的。在我看来，第一个是简洁、可读和高效的。numba如果性能至关重要并且这是瓶颈的一部分，则仅查看其他方法，例如下面的方法。

`numba.njit`: greater efficiency

`numba.njit`: 更高的效率

When loops areconsidered viable they are usually optimised via numbawith underlying NumPy arrays to move as much as possible to C.

当循环被认为可行时，它们通常通过numba底层 NumPy 数组进行优化，以尽可能多地移动到 C。

Indeed, numbaimproves performance to microseconds. Without some cumbersome work, it will be difficult to get much more efficient than this.

事实上，将numba性能提高到微秒。如果没有一些繁琐的工作，将很难获得比这更高的效率。

from numba import njit

@njit
def divide(a, b):
    res = np.empty(a.shape)
    for i in range(len(a)):
        if b[i] != 0:
            res[i] = a[i] / b[i]
        else:
            res[i] = 0
    return res

%timeit divide(df['A'].values, df['B'].values)  # 717 μs

Using @njit(parallel=True)may provide a further boost for larger arrays.

使用@njit(parallel=True)可以为更大的阵列提供进一步的推动。

¹Numeric types include: int, float, datetime, bool, category. They excludeobjectdtype and can be held in contiguous memory blocks.

¹数字类型包括：int、float、datetime、bool、category。它们不包括objectdtype 并且可以保存在连续的内存块中。

²There are at least 2 reasons why NumPy operations are efficient versus Python:

²NumPy 操作比 Python 高效的原因至少有 2 个：

Everything in Python is an object. This includes, unlike C, numbers. Python types therefore have an overhead which does not exist with native C types.
NumPy methods are usually C-based. In addition, optimised algorithms are used where possible.

Python 中的一切都是对象。与 C 不同，这包括数字。因此，Python 类型具有本机 C 类型不存在的开销。
NumPy 方法通常基于 C。此外，尽可能使用优化算法。

Answer 2

回答by PMende

The more complex your functions get (i.e., the less numpycan move to its own internals), the more you will see that the performance won't be that different. For example:

您的函数越复杂（即，numpy移至其内部的可能性越小），您就越会发现性能不会有太大差异。例如：

name_series = pd.Series(np.random.choice(['adam', 'chang', 'eliza', 'odom'], replace=True, size=100000))

def parse_name(name):
    if name.lower().startswith('a'):
        return 'A'
    elif name.lower().startswith('e'):
        return 'E'
    elif name.lower().startswith('i'):
        return 'I'
    elif name.lower().startswith('o'):
        return 'O'
    elif name.lower().startswith('u'):
        return 'U'
    return name

parse_name_vec = np.vectorize(parse_name)

Doing some timings:

做一些计时：

Using Apply

使用应用

%timeit name_series.apply(parse_name)

Results:

结果：

76.2 ms ± 626 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using np.vectorize

使用 np.vectorize

%timeit parse_name_vec(name_series)

Results:

结果：

77.3 ms ± 216 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numpy tries to turn python functions into numpy ufuncobjects when you call np.vectorize. How it does this, I don't actually know - you'd have to dig more into the internals of numpy than I'm willing to ATM. That said, it seems to do a better job on simply numerical functions than this string-based function here.

NumPy的试图扭转蟒蛇功能为numpy的ufunc对象，当你调用np.vectorize。它是如何做到这一点的，我实际上并不知道 - 你必须比我愿意 ATM 更深入地挖掘 numpy 的内部结构。也就是说，与这里的基于字符串的函数相比，它似乎在简单的数字函数上做得更好。

Cranking the size up to 1,000,000:

将大小设置为 1,000,000：

name_series = pd.Series(np.random.choice(['adam', 'chang', 'eliza', 'odom'], replace=True, size=1000000))

apply

%timeit name_series.apply(parse_name)

Results:

结果：

769 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

np.vectorize

%timeit parse_name_vec(name_series)

Results:

结果：

794 ms ± 4.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A better (vectorized) way with np.select:

更好的（矢量化）方式np.select：

cases = [
    name_series.str.lower().str.startswith('a'), name_series.str.lower().str.startswith('e'),
    name_series.str.lower().str.startswith('i'), name_series.str.lower().str.startswith('o'),
    name_series.str.lower().str.startswith('u')
]
replacements = 'A E I O U'.split()

Timings:

时间：

%timeit np.select(cases, replacements, default=name_series)

Results:

结果：

67.2 ms ± 683 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Python Pandas 的性能应用 vs np.vectorize 从现有列创建新列

提问by stackoverflowuser2010

回答by jpp

Python-level loops

Python 级循环

`pd.DataFrame.apply`: just another loop

`pd.DataFrame.apply`: 只是另一个循环

`np.vectorize`: fake vectorisation

`np.vectorize`: 假矢量化

True vectorisation: what you shoulduse

真正的矢量化：你应该使用什么

`numba.njit`: greater efficiency

`numba.njit`: 更高的效率

回答by PMende

相关推荐

最近更新

标签

Python Pandas 的性能应用 vs np.vectorize 从现有列创建新列

提问by stackoverflowuser2010

回答by jpp

Python-level loops

Python 级循环

pd.DataFrame.apply: just another loop

pd.DataFrame.apply: 只是另一个循环

np.vectorize: fake vectorisation

np.vectorize: 假矢量化

True vectorisation: what you shoulduse

真正的矢量化：你应该使用什么

numba.njit: greater efficiency

numba.njit: 更高的效率

回答by PMende

相关推荐

Python seaborn color_palette 作为 matplotlib 颜色图

Python 以不同的色调绘制点标记和线条，但与 seaborn 风格相同

Python 带有 Selenium 错误的 PhantomJS：消息：'phantomjs' 可执行文件需要在 PATH 中

Visual Studio 代码窗口，Python Pandas。没有名为 pandas 的模块

相关推荐

最近更新

标签

`pd.DataFrame.apply`: just another loop

`pd.DataFrame.apply`: 只是另一个循环

`np.vectorize`: fake vectorisation

`np.vectorize`: 假矢量化

`numba.njit`: greater efficiency

`numba.njit`: 更高的效率