为什么 apply 有时并不比 pandas 数据帧中的 for-loop 快?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38938318/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why apply sometimes isn't faster than for-loop in pandas dataframe?
提问by Wedoso
It seems apply
could accelerate the operation process on dataframe in most cases. But when I use apply
I doesn't find the speedup. Here comes my example, I have a dataframe with two columns
apply
在大多数情况下,它似乎可以加速数据帧的操作过程。但是当我使用时,apply
我没有找到加速。这是我的示例,我有一个包含两列的数据框
>>>df
index col1 col2
1 10 20
2 20 30
3 30 40
What I want to do is to calculate values for each row in dataframe by implementing a function R(x)
on col1
and the result will be divided by the values in col2
. For example, the result of the first row should be R(10)/20
.
So here is my function which will be called in apply
我想要做的就是通过实施函数来计算值,每一行数据帧R(x)
上col1
,结果将由值划分col2
。例如,第一行的结果应该是R(10)/20
。所以这是我的函数,它将被调用apply
def _f(input):
return R(input['col1'])/input['col2']
Then I call _f
in apply
: df.apply(_f, axis=1)
然后我打电话_f
的apply
:df.apply(_f, axis=1)
But I find in this case, apply
is much slower than for loop, like
但我发现在这种情况下,apply
比 for 循环慢得多,比如
for i in list(df.index)
new_df.loc[i] = R(df.loc[i,'col1'])/df.loc[i,'col2']
Can anyone explain the reason?
任何人都可以解释原因吗?
回答by juanpa.arrivillaga
It is my understanding that .apply
is notgenerally faster than iteration over the axis. I believe underneath the hood it is merely a loop over the axis, except you are incurring the overhead of a function call each time in this case.
我的理解.apply
是通常不会比轴上的迭代快。我相信在幕后,它只是轴上的一个循环,除非在这种情况下每次都会产生函数调用的开销。
If we look at the source code, we can see that essentially we are iterating over the indicated axis and applying the function, building the individual results as series into a dictionary, and the finally calling the dataframe constructor on the dictionary returning a new DataFrame:
如果我们查看源代码,我们可以看到本质上我们是在指定的轴上迭代并应用函数,将单个结果作为系列构建到字典中,最后调用字典上的数据帧构造函数,返回一个新的数据帧:
if axis == 0:
series_gen = (self._ixs(i, axis=1)
for i in range(len(self.columns)))
res_index = self.columns
res_columns = self.index
elif axis == 1:
res_index = self.index
res_columns = self.columns
values = self.values
series_gen = (Series.from_array(arr, index=res_columns, name=name,
dtype=dtype)
for i, (arr, name) in enumerate(zip(values,
res_index)))
else: # pragma : no cover
raise AssertionError('Axis must be 0 or 1, got %s' % str(axis))
i = None
keys = []
results = {}
if ignore_failures:
successes = []
for i, v in enumerate(series_gen):
try:
results[i] = func(v)
keys.append(v.name)
successes.append(i)
except Exception:
pass
# so will work with MultiIndex
if len(successes) < len(res_index):
res_index = res_index.take(successes)
else:
try:
for i, v in enumerate(series_gen):
results[i] = func(v)
keys.append(v.name)
except Exception as e:
if hasattr(e, 'args'):
# make sure i is defined
if i is not None:
k = res_index[i]
e.args = e.args + ('occurred at index %s' %
pprint_thing(k), )
raise
if len(results) > 0 and is_sequence(results[0]):
if not isinstance(results[0], Series):
index = res_columns
else:
index = None
result = self._constructor(data=results, index=index)
result.columns = res_index
if axis == 1:
result = result.T
result = result._convert(datetime=True, timedelta=True, copy=False)
else:
result = Series(results)
result.index = res_index
return result
Specifically:
具体来说:
for i, v in enumerate(series_gen):
results[i] = func(v)
keys.append(v.name)
Where series_gen
was constructed based on the requested axis.
凡series_gen
根据请求的轴构建。
To get more performance out of a function, you can follow the advice given here.
要从函数中获得更多性能,您可以遵循此处给出的建议。
Essentially, your options are:
基本上,您的选择是:
- Write a C extension
- Use
numba
(a JIT compiler) - Use
pandas.eval
to squeeze performance out of large Dataframes
- 编写 C 扩展
- 使用
numba
(一个 JIT 编译器) - 用于
pandas.eval
从大型数据帧中挤出性能