为什么 apply 有时并不比 pandas 数据帧中的 for-loop 快？

Question

提问by Wedoso

It seems applycould accelerate the operation process on dataframe in most cases. But when I use applyI doesn't find the speedup. Here comes my example, I have a dataframe with two columns

apply在大多数情况下，它似乎可以加速数据帧的操作过程。但是当我使用时，apply我没有找到加速。这是我的示例，我有一个包含两列的数据框

>>>df
index col1 col2
1 10 20
2 20 30
3 30 40

What I want to do is to calculate values for each row in dataframe by implementing a function R(x)on col1and the result will be divided by the values in col2. For example, the result of the first row should be R(10)/20. So here is my function which will be called in apply

我想要做的就是通过实施函数来计算值，每一行数据帧R(x)上col1，结果将由值划分col2。例如，第一行的结果应该是R(10)/20。所以这是我的函数，它将被调用apply

def _f(input):
  return R(input['col1'])/input['col2']

Then I call _fin apply: df.apply(_f, axis=1)

然后我打电话_f的apply：df.apply(_f, axis=1)

But I find in this case, applyis much slower than for loop, like

但我发现在这种情况下，apply比 for 循环慢得多，比如

for i in list(df.index)
  new_df.loc[i] = R(df.loc[i,'col1'])/df.loc[i,'col2']

Can anyone explain the reason?

任何人都可以解释原因吗？

Answer 1

回答by juanpa.arrivillaga

It is my understanding that .applyis notgenerally faster than iteration over the axis. I believe underneath the hood it is merely a loop over the axis, except you are incurring the overhead of a function call each time in this case.

我的理解.apply是通常不会比轴上的迭代快。我相信在幕后，它只是轴上的一个循环，除非在这种情况下每次都会产生函数调用的开销。

If we look at the source code, we can see that essentially we are iterating over the indicated axis and applying the function, building the individual results as series into a dictionary, and the finally calling the dataframe constructor on the dictionary returning a new DataFrame:

如果我们查看源代码，我们可以看到本质上我们是在指定的轴上迭代并应用函数，将单个结果作为系列构建到字典中，最后调用字典上的数据帧构造函数，返回一个新的数据帧：

    if axis == 0:
        series_gen = (self._ixs(i, axis=1)
                      for i in range(len(self.columns)))
        res_index = self.columns
        res_columns = self.index
    elif axis == 1:
        res_index = self.index
        res_columns = self.columns
        values = self.values
        series_gen = (Series.from_array(arr, index=res_columns, name=name,
                                        dtype=dtype)
                      for i, (arr, name) in enumerate(zip(values,
                                                          res_index)))
    else:  # pragma : no cover
        raise AssertionError('Axis must be 0 or 1, got %s' % str(axis))

    i = None
    keys = []
    results = {}
    if ignore_failures:
        successes = []
        for i, v in enumerate(series_gen):
            try:
                results[i] = func(v)
                keys.append(v.name)
                successes.append(i)
            except Exception:
                pass
        # so will work with MultiIndex
        if len(successes) < len(res_index):
            res_index = res_index.take(successes)
    else:
        try:
            for i, v in enumerate(series_gen):
                results[i] = func(v)
                keys.append(v.name)
        except Exception as e:
            if hasattr(e, 'args'):
                # make sure i is defined
                if i is not None:
                    k = res_index[i]
                    e.args = e.args + ('occurred at index %s' %
                                       pprint_thing(k), )
            raise

    if len(results) > 0 and is_sequence(results[0]):
        if not isinstance(results[0], Series):
            index = res_columns
        else:
            index = None

        result = self._constructor(data=results, index=index)
        result.columns = res_index

        if axis == 1:
            result = result.T
        result = result._convert(datetime=True, timedelta=True, copy=False)

    else:

        result = Series(results)
        result.index = res_index

    return result

Specifically:

具体来说：

for i, v in enumerate(series_gen):
                results[i] = func(v)
                keys.append(v.name)

Where series_genwas constructed based on the requested axis.

凡series_gen根据请求的轴构建。

To get more performance out of a function, you can follow the advice given here.

要从函数中获得更多性能，您可以遵循此处给出的建议。

Essentially, your options are:

基本上，您的选择是：

Write a C extension
Use numba(a JIT compiler)
Use pandas.evalto squeeze performance out of large Dataframes

编写 C 扩展
使用numba（一个 JIT 编译器）
用于pandas.eval从大型数据帧中挤出性能

为什么 apply 有时并不比 pandas 数据帧中的 for-loop 快？

提问by Wedoso

回答by juanpa.arrivillaga

相关推荐

最近更新

标签

为什么 apply 有时并不比 pandas 数据帧中的 for-loop 快？

提问by Wedoso

回答by juanpa.arrivillaga

相关推荐

将二维数组放入 Pandas 系列

pandas Panda 的数据框将一列拆分为多列

在 Pandas 数据框中的每一列上应用函数

pandas 如何总结numpy中的一列

相关推荐

最近更新

标签