我什么时候应该在我的代码中使用 pandas apply() ?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/54432583/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
When should I ever want to use pandas apply() in my code?
提问by cs95
I have seen many answers posted to questions on Stack Overflow involving the use of the Pandas method apply
. I have also seen users commenting under them saying that "apply
is slow, and should be avoided".
我在 Stack Overflow 上看到很多关于使用 Pandas 方法的问题的答案apply
。我也看到用户在他们下面评论说“apply
很慢,应该避免”。
I have read many articles on the topic of performance that explain apply
is slow. I have also seen a disclaimer in the docs about how apply
is simply a convenience function for passing UDFs (can't seem to find that now). So, the general consensus is that apply
should be avoided if possible. However, this raises the following questions:
我已经阅读了许多关于性能主题的文章,这些文章解释的apply
很慢。我还在文档中看到了一个免责声明,说明如何apply
只是传递 UDF 的便利函数(现在似乎找不到)。因此,普遍的共识是apply
应该尽可能避免。然而,这引发了以下问题:
- If
apply
is so bad, then why is it in the API? - How and when should I make my code
apply
-free? - Are there ever any situations where
apply
is good(better than other possible solutions)?
- 如果
apply
这么糟糕,那为什么它会出现在 API 中? - 我应该如何以及何时使我的代码
apply
免费? - 有没有什么情况
apply
是好的(比其他可能的解决方案更好)?
采纳答案by cs95
apply
, the Convenience Function you Never Needed
apply
,您从未需要的便利功能
We start by addressing the questions in the OP, one by one.
我们从一一解决 OP 中的问题开始。
"Ifapply is so bad, then why is it in the API?"
“如果apply这么糟糕,那为什么它会出现在 API 中?”
DataFrame.apply
and Series.apply
are convenience functionsdefined on DataFrame and Series object respectively. apply
accepts any user defined function that applies a transformation/aggregation on a DataFrame. apply
is effectively a silver bullet that does whatever any existing pandas function cannot do.
DataFrame.apply
和Series.apply
是分别定义在 DataFrame 和 Series 对象上的便利函数。apply
接受在 DataFrame 上应用转换/聚合的任何用户定义的函数。apply
是有效的灵丹妙药,可以完成任何现有 Pandas 函数无法完成的任务。
Some of the things apply
can do:
有些事情apply
可以做:
- Run any user-defined function on a DataFrame or Series
- Apply a function either row-wise (
axis=1
) or column-wise (axis=0
) on a DataFrame - Perform index alignment while applying the function
- Perform aggregation with user-defined functions (however, we usually prefer
agg
ortransform
in these cases) - Perform element-wise transformations
- Broadcast aggregated results to original rows (see the
result_type
argument). - Accept positional/keyword arguments to pass to the user-defined functions.
- 在 DataFrame 或 Series 上运行任何用户定义的函数
- 在 DataFrame 上按行 (
axis=1
) 或按列( )应用函数axis=0
- 应用函数时执行索引对齐
- 使用用户定义的函数执行聚合(但是,我们通常更喜欢
agg
或transform
在这些情况下) - 执行逐元素转换
- 将聚合结果广播到原始行(请参阅
result_type
参数)。 - 接受要传递给用户定义函数的位置/关键字参数。
...Among others. For more information, see Row or Column-wise Function Applicationin the documentation.
……等等。有关更多信息,请参阅文档中的行或列功能应用程序。
So, with all these features, why is apply
bad? It is because apply
isslow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your functionto each row/column as necessary. Additionally, handling allof the situations above means apply
incurs some major overhead at each iteration. Further, apply
consumes a lot more memory, which is a challenge for memory bounded applications.
那么,有了所有这些功能,为什么apply
不好呢?这是因为apply
是缓慢的。Pandas 不对您的函数的性质做任何假设,因此会根据需要迭代地将您的函数应用于每一行/列。此外,处理上述所有情况意味着apply
每次迭代都会产生一些主要开销。此外,apply
消耗更多的内存,这对于内存受限的应用程序来说是一个挑战。
There are very few situations where apply
is appropriate to use (more on that below). If you're not sure whether you should be using apply
, you probably shouldn't.
很少apply
有适合使用的情况(更多内容见下文)。如果您不确定是否应该使用apply
,则可能不应该使用。
Let's address the next question.
让我们解决下一个问题。
"How and when should I make my codeapply -free?"
“我应该如何以及何时使我的代码免费应用?”
To rephrase, here are some common situations where you will want to get ridof any calls to apply
.
换个说法,这里有一些常见的情况,您将希望摆脱对apply
.
Numeric Data
数字数据
If you're working with numeric data, there is likely already a vectorized cython function that does exactly what you're trying to do (if not, please either ask a question on Stack Overflow or open a feature request on GitHub).
如果您正在处理数字数据,那么可能已经有一个矢量化的 cython 函数可以完全满足您的要求(如果没有,请在 Stack Overflow 上提问或在 GitHub 上打开功能请求)。
Contrast the performance of apply
for a simple addition operation.
对比apply
简单加法运算的性能。
df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df
A B
0 9 12
1 4 7
2 2 5
3 1 4
df.apply(np.sum)
A 16
B 28
dtype: int64
df.sum()
A 16
B 28
dtype: int64
Performance wise, there's no comparison, the cythonized equivalent is much faster. There's no need for a graph, because the difference is obvious even for toy data.
性能方面,没有可比性,cythonized 等价物要快得多。不需要图表,因为即使对于玩具数据,差异也很明显。
%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 μs ± 8.16 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Even if you enable passing raw arrays with the raw
argument, it's still twice as slow.
即使您启用带raw
参数传递原始数组,它的速度仍然是原来的两倍。
%timeit df.apply(np.sum, raw=True)
840 μs ± 691 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Another example:
另一个例子:
df.apply(lambda x: x.max() - x.min())
A 8
B 8
dtype: int64
df.max() - df.min()
A 8
B 8
dtype: int64
%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()
2.43 ms ± 450 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In general, seek out vectorized alternatives if possible.
一般来说,如果可能,寻找矢量化的替代方案。
String/Regex
字符串/正则表达式
Pandas provides "vectorized" string functions in most situations, but there are rare cases where those functions do not... "apply", so to speak.
Pandas 在大多数情况下提供“矢量化”字符串函数,但在极少数情况下,这些函数不......“应用”,可以这么说。
A common problem is to check whether a value in a column is present in another column of the same row.
一个常见的问题是检查列中的值是否存在于同一行的另一列中。
df = pd.DataFrame({
'Name': ['mickey', 'donald', 'minnie'],
'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
'Value': [20, 10, 86]})
df
Name Value Title
0 mickey 20 wonderland
1 donald 10 welcome to donald's castle
2 minnie 86 Minnie mouse clubhouse
This should return the row second and third row, since "donald" and "minnie" are present in their respective "Title" columns.
这应该返回第二行和第三行,因为“donald”和“minnie”出现在它们各自的“Title”列中。
Using apply, this would be done using
使用应用,这将使用
df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)
0 False
1 True
2 True
dtype: bool
df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
Name Title Value
1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
However, a better solution exists using list comprehensions.
但是,使用列表推导式存在更好的解决方案。
df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]
Name Title Value
1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]
2.85 ms ± 38.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 μs ± 16.4 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The thing to note here is that iterative routines happen to be faster than apply
, because of the lower overhead. If you need to handle NaNs and invalid dtypes, you can build on this using a custom function you can then call with arguments inside the list comprehension.
这里要注意的是apply
,由于开销较低,迭代例程恰好比 快。如果您需要处理 NaN 和无效的 dtypes,您可以使用自定义函数在此基础上进行构建,然后您可以使用列表推导式中的参数进行调用。
For more information on when list comprehensions should be considered a good option, see my writeup: For loops with pandas - When should I care?.
有关何时应将列表推导视为一个不错的选择的更多信息,请参阅我的文章:For loops with pandas - 我什么时候应该关心?.
Note
Date and datetime operations also have vectorized versions. So, for example, you should preferpd.to_datetime(df['date'])
, over, say,df['date'].apply(pd.to_datetime)
.Read more at the docs.
注意
日期和日期时间操作也有矢量化版本。因此,例如,您应该更喜欢pd.to_datetime(df['date'])
, 而不是说df['date'].apply(pd.to_datetime)
。在文档中阅读更多内容 。
A Common Pitfall: Exploding Columns of Lists
一个常见的陷阱:爆炸的列表列
s = pd.Series([[1, 2]] * 3)
s
0 [1, 2]
1 [1, 2]
2 [1, 2]
dtype: object
People are tempted to use apply(pd.Series)
. This is horriblein terms of performance.
人们很想使用apply(pd.Series)
. 这在性能方面是可怕的。
s.apply(pd.Series)
0 1
0 1 2
1 1 2
2 1 2
A better option is to listify the column and pass it to pd.DataFrame.
更好的选择是列出列并将其传递给 pd.DataFrame。
pd.DataFrame(s.tolist())
0 1
0 1 2
1 1 2
2 1 2
%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())
2.65 ms ± 294 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 μs ± 40.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Lastly,
最后,
"Are there any situations where
apply
is good?"
“有没有
apply
好的情况?”
Apply is a convenience function, so there aresituations where the overhead is negligible enough to forgive. It really depends on how many times the function is called.
应用是一个方便的功能,所以在这里的开销可以忽略不计,足以原谅的情况。这实际上取决于函数被调用的次数。
Functions that are Vectorized for Series, but not DataFrames
What if you want to apply a string operation on multiple columns? What if you want to convert multiple columns to datetime? These functions are vectorized for Series only, so they must be appliedover each column that you want to convert/operate on.
为系列矢量化的函数,但不是为数据帧矢量化的函数,
如果您想对多列应用字符串操作怎么办?如果要将多列转换为日期时间怎么办?这些函数仅针对 Series 进行了矢量化处理,因此必须将它们应用于要转换/操作的每一列。
df = pd.DataFrame(
pd.date_range('2018-12-31','2019-01-31', freq='2D').date.astype(str).reshape(-1, 2),
columns=['date1', 'date2'])
df
date1 date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30
df.dtypes
date1 object
date2 object
dtype: object
This is an admissible case for apply
:
这是一个可接受的案例apply
:
df.apply(pd.to_datetime, errors='coerce').dtypes
date1 datetime64[ns]
date2 datetime64[ns]
dtype: object
Note that it would also make sense to stack
, or just use an explicit loop. All these options are slightly faster than using apply
, but the difference is small enough to forgive.
请注意,stack
使用或仅使用显式循环也是有意义的。所有这些选项都比使用 略快apply
,但差异小到可以原谅。
%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')
5.49 ms ± 247 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can make a similar case for other operations such as string operations, or conversion to category.
您可以对其他操作(例如字符串操作或转换为类别)进行类似的处理。
u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))
v/s
比/秒
u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
v[c] = df[c].astype(category)
And so on...
等等...
Converting Series to str
: astype
versus apply
将系列转换为str
:astype
与apply
This seems like an idiosyncrasy of the API. Using apply
to convert integers in a Series to string is comparable (and sometimes faster) than using astype
.
这似乎是 API 的一个特性。使用apply
将系列中的整数转换为字符串与使用astype
.
The graph was plotted using the
perfplot
library.
该图是使用
perfplot
库绘制的。
import perfplot
perfplot.show(
setup=lambda n: pd.Series(np.random.randint(0, n, n)),
kernels=[
lambda s: s.astype(str),
lambda s: s.apply(str)
],
labels=['astype', 'apply'],
n_range=[2**k for k in range(1, 20)],
xlabel='N',
logx=True,
logy=True,
equality_check=lambda x, y: (x == y).all())
With floats, I see the astype
is consistently as fast as, or slightly faster than apply
. So this has to do with the fact that the data in the test is integer type.
对于浮动,我看到astype
始终与apply
. 所以这与测试中的数据是整数类型的事实有关。
GroupBy
operations with chained transformations
GroupBy
链式转换操作
GroupBy.apply
has not been discussed until now, but GroupBy.apply
is also an iterative convenience function to handle anything that the existing GroupBy
functions do not.
GroupBy.apply
直到现在还没有讨论过,但GroupBy.apply
它也是一个迭代的便利函数,可以处理现有GroupBy
函数没有的任何事情。
One common requirement is to perform a GroupBy and then two prime operations such as a "lagged cumsum":
一个常见的要求是先执行 GroupBy,然后执行两个主要操作,例如“滞后累积和”:
df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df
A B
0 a 12
1 a 7
2 b 5
3 c 4
4 c 5
5 c 4
6 d 3
7 d 2
8 e 1
9 e 10
You'd need two successive groupby calls here:
您需要在这里进行两次连续的 groupby 调用:
df.groupby('A').B.cumsum().groupby(df.A).shift()
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
Using apply
, you can shorten this to a a single call.
使用apply
,您可以将其缩短为单个调用。
df.groupby('A').B.apply(lambda x: x.cumsum().shift())
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
It is very hard to quantify the performance because it depends on the data. But in general, apply
is an acceptable solution if the goal is to reduce a groupby
call (because groupby
is also quite expensive).
量化性能非常困难,因为它取决于数据。但总的来说,apply
如果目标是减少groupby
通话,这是一个可以接受的解决方案(因为groupby
也相当昂贵)。
Other Caveats
其他注意事项
Aside from the caveats mentioned above, it is also worth mentioning that apply
operates on the first row (or column) twice. This is done to determine whether the function has any side effects. If not, apply
may be able to use a fast-path for evaluating the result, else it falls back to a slow implementation.
除了上面提到的注意事项之外,还值得一提的是apply
,对第一行(或列)进行了两次操作。这样做是为了确定该功能是否有任何副作用。如果没有,apply
也许可以使用快速路径来评估结果,否则它会退回到缓慢的实现。
df = pd.DataFrame({
'A': [1, 2],
'B': ['x', 'y']
})
def func(x):
print(x['A'])
return x
df.apply(func, axis=1)
# 1
# 1
# 2
A B
0 1 x
1 2 y
This behaviour is also seen in GroupBy.apply
on pandas versions <0.25 (it was fixed for 0.25, see here for more information.)
此行为也出现在GroupBy.apply
Pandas版本 <0.25 上(它已修复为 0.25,请参阅此处了解更多信息。)
回答by jpp
Not all apply
s are alike
并非所有的apply
s 都一样
The below chart suggests when to consider apply
1. Green means possibly efficient; red avoid.
下表建议何时考虑apply
1。绿色意味着可能有效;红色避免。
Someof this is intuitive: pd.Series.apply
is a Python-level row-wise loop, ditto pd.DataFrame.apply
row-wise (axis=1
). The misuses of these are many and wide-ranging. The other post deals with them in more depth. Popular solutions are to use vectorised methods, list comprehensions (assumes clean data), or efficient tools such as the pd.DataFrame
constructor (e.g. to avoid apply(pd.Series)
).
其中一些是直观的:pd.Series.apply
是 Python 级别的按行循环,同上按pd.DataFrame.apply
行 ( axis=1
)。这些的滥用是多方面的,而且范围很广。另一篇文章更深入地讨论了它们。流行的解决方案是使用向量化方法、列表推导式(假设数据干净)或高效工具,例如pd.DataFrame
构造函数(例如避免apply(pd.Series)
)。
If you are using pd.DataFrame.apply
row-wise, specifying raw=True
(where possible) is often beneficial. At this stage, numba
is usually a better choice.
如果您按pd.DataFrame.apply
行使用,指定raw=True
(在可能的情况下)通常是有益的。在这个阶段,numba
通常是更好的选择。
GroupBy.apply
: generally favoured
GroupBy.apply
: 普遍青睐
Repeating groupby
operations to avoid apply
will hurt performance. GroupBy.apply
is usually fine here, provided the methods you use in your custom function are themselves vectorised. Sometimes there is no native Pandas method for a groupwise aggregation you wish to apply. In this case, for a small number of groups apply
with a custom function may still offer reasonable performance.
groupby
避免重复操作apply
会损害性能。GroupBy.apply
在这里通常很好,前提是您在自定义函数中使用的方法本身是矢量化的。有时,您希望应用的分组聚合没有原生 Pandas 方法。在这种情况下,对于少数apply
具有自定义功能的组,仍然可以提供合理的性能。
pd.DataFrame.apply
column-wise: a mixed bag
pd.DataFrame.apply
逐列:混合袋
pd.DataFrame.apply
column-wise (axis=0
) is an interesting case. For a small number of rows versus a large number of columns, it's almost always expensive. For a large number of rows relative to columns, the more common case, you may sometimessee significant performance improvements using apply
:
pd.DataFrame.apply
column-wise ( axis=0
) 是一个有趣的例子。对于少量行与大量列,它几乎总是昂贵的。对于相对于列的大量行,更常见的情况是,您有时可能会使用apply
以下方法看到显着的性能改进:
# Python 3.7, Pandas 0.23.4
np.random.seed(0)
df = pd.DataFrame(np.random.random((10**7, 3))) # Scenario_1, many rows
df = pd.DataFrame(np.random.random((10**4, 10**3))) # Scenario_2, many columns
# Scenario_1 | Scenario_2
%timeit df.sum() # 800 ms | 109 ms
%timeit df.apply(pd.Series.sum) # 568 ms | 325 ms
%timeit df.max() - df.min() # 1.63 s | 314 ms
%timeit df.apply(lambda x: x.max() - x.min()) # 838 ms | 473 ms
%timeit df.mean() # 108 ms | 94.4 ms
%timeit df.apply(pd.Series.mean) # 276 ms | 233 ms
1There are exceptions, but these are usually marginal or uncommon. A couple of examples:
1也有例外,但这些通常很少或不常见。几个例子:
df['col'].apply(str)
may slightly outperformdf['col'].astype(str)
.df.apply(pd.to_datetime)
working on strings doesn't scale well with rows versus a regularfor
loop.
df['col'].apply(str)
可能略微跑赢大盘df['col'].astype(str)
。df.apply(pd.to_datetime)
与常规for
循环相比,处理字符串不能很好地扩展行。
回答by Pete Cacioppi
For axis=1
(i.e. row-wise functions) then you can just use the following function in lieu of apply
. I wonder why this isn't the pandas
behavior. (Untested with compound indexes, but it does appear to be much faster than apply
)
对于axis=1
(即逐行函数),您可以使用以下函数代替apply
. 我想知道为什么这不是pandas
行为。(未经复合索引测试,但它似乎比 快得多apply
)
def faster_df_apply(df, func):
cols = list(df.columns)
data, index = [], []
for row in df.itertuples(index=True):
row_dict = {f:v for f,v in zip(cols, row[1:])}
data.append(func(row_dict))
index.append(row[0])
return pd.Series(data, index=index)
回答by astro123
Are there ever any situations where apply
is good?
Yes, sometimes.
有没有什么情况apply
是好的?是的,有时。
Task: decode Unicode strings.
任务:解码 Unicode 字符串。
import numpy as np
import pandas as pd
import unidecode
s = pd.Series(['ma?ana','Ce?ía'])
s.head()
0 ma?ana
1 Ce?ía
s.apply(unidecode.unidecode)
0 manana
1 Cenia
Update
I was by no means advocating for the use of apply
, just thinking since the NumPy
cannot deal with the above situation, it could have been a good candidate for pandas apply
. But I was forgetting the plain ol list comprehension thanks to the reminder by @jpp.
更新
我绝不提倡使用apply
,只是认为既然NumPy
无法处理上述情况,它本来可以成为pandas apply
. 但是由于@jpp 的提醒,我忘记了简单的 ol 列表理解。