我什么时候应该在我的代码中使用 pandas apply() ?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/54432583/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:18:23  来源:igfitidea点击:

When should I ever want to use pandas apply() in my code?

pythonpandasperformanceapply

提问by cs95

I have seen many answers posted to questions on Stack Overflow involving the use of the Pandas method apply. I have also seen users commenting under them saying that "applyis slow, and should be avoided".

我在 Stack Overflow 上看到很多关于使用 Pandas 方法的问题的答案apply。我也看到用户在他们下面评论说“apply很慢,应该避免”。

I have read many articles on the topic of performance that explain applyis slow. I have also seen a disclaimer in the docs about how applyis simply a convenience function for passing UDFs (can't seem to find that now). So, the general consensus is that applyshould be avoided if possible. However, this raises the following questions:

我已经阅读了许多关于性能主题的文章,这些文章解释的apply很慢。我还在文档中看到了一个免责声明,说明如何apply只是传递 UDF 的便利函数(现在似乎找不到)。因此,普遍的共识是apply应该尽可能避免。然而,这引发了以下问题:

  1. If applyis so bad, then why is it in the API?
  2. How and when should I make my code apply-free?
  3. Are there ever any situations where applyis good(better than other possible solutions)?
  1. 如果apply这么糟糕,那为什么它会出现在 API 中?
  2. 我应该如何以及何时使我的代码apply免费?
  3. 有没有什么情况apply好的(比其他可能的解决方案更好)?

采纳答案by cs95

apply, the Convenience Function you Never Needed

apply,您从未需要的便利功能

We start by addressing the questions in the OP, one by one.

我们从一一解决 OP 中的问题开始。

"Ifapply is so bad, then why is it in the API?"

如果apply这么糟糕,那为什么它会出现在 API 中?

DataFrame.applyand Series.applyare convenience functionsdefined on DataFrame and Series object respectively. applyaccepts any user defined function that applies a transformation/aggregation on a DataFrame. applyis effectively a silver bullet that does whatever any existing pandas function cannot do.

DataFrame.applySeries.apply是分别定义在 DataFrame 和 Series 对象上的便利函数apply接受在 DataFrame 上应用转换/聚合的任何用户定义的函数。apply是有效的灵丹妙药,可以完成任何现有 Pandas 函数无法完成的任务。

Some of the things applycan do:

有些事情apply可以做:

  • Run any user-defined function on a DataFrame or Series
  • Apply a function either row-wise (axis=1) or column-wise (axis=0) on a DataFrame
  • Perform index alignment while applying the function
  • Perform aggregation with user-defined functions (however, we usually prefer aggor transformin these cases)
  • Perform element-wise transformations
  • Broadcast aggregated results to original rows (see the result_typeargument).
  • Accept positional/keyword arguments to pass to the user-defined functions.
  • 在 DataFrame 或 Series 上运行任何用户定义的函数
  • 在 DataFrame 上按行 ( axis=1) 或按列( )应用函数axis=0
  • 应用函数时执行索引对齐
  • 使用用户定义的函数执行聚合(但是,我们通常更喜欢aggtransform在这些情况下)
  • 执行逐元素转换
  • 将聚合结果广播到原始行(请参阅result_type参数)。
  • 接受要传递给用户定义函数的位置/关键字参数。

...Among others. For more information, see Row or Column-wise Function Applicationin the documentation.

……等等。有关更多信息,请参阅文档中的行或列功能应用程序

So, with all these features, why is applybad? It is because applyisslow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your functionto each row/column as necessary. Additionally, handling allof the situations above means applyincurs some major overhead at each iteration. Further, applyconsumes a lot more memory, which is a challenge for memory bounded applications.

那么,有了所有这些功能,为什么apply不好呢?这是因为apply缓慢的。Pandas 不对您的函数的性质做任何假设,因此会根据需要迭代地将您的函数应用于每一行/列。此外,处理上述所有情况意味着apply每次迭代都会产生一些主要开销。此外,apply消耗更多的内存,这对于内存受限的应用程序来说是一个挑战。

There are very few situations where applyis appropriate to use (more on that below). If you're not sure whether you should be using apply, you probably shouldn't.

很少apply有适合使用的情况(更多内容见下文)。如果您不确定是否应该使用apply,则可能不应该使用。



Let's address the next question.

让我们解决下一个问题。

"How and when should I make my codeapply -free?"

我应该如何以及何时使我的代码免费应用

To rephrase, here are some common situations where you will want to get ridof any calls to apply.

换个说法,这里有一些常见的情况,您将希望摆脱apply.

Numeric Data

数字数据

If you're working with numeric data, there is likely already a vectorized cython function that does exactly what you're trying to do (if not, please either ask a question on Stack Overflow or open a feature request on GitHub).

如果您正在处理数字数据,那么可能已经有一个矢量化的 cython 函数可以完全满足您的要求(如果没有,请在 Stack Overflow 上提问或在 GitHub 上打开功能请求)。

Contrast the performance of applyfor a simple addition operation.

对比apply简单加法运算的性能。

df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df

   A   B
0  9  12
1  4   7
2  2   5
3  1   4

df.apply(np.sum)

A    16
B    28
dtype: int64

df.sum()

A    16
B    28
dtype: int64

Performance wise, there's no comparison, the cythonized equivalent is much faster. There's no need for a graph, because the difference is obvious even for toy data.

性能方面,没有可比性,cythonized 等价物要快得多。不需要图表,因为即使对于玩具数据,差异也很明显。

%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 μs ± 8.16 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Even if you enable passing raw arrays with the rawargument, it's still twice as slow.

即使您启用带raw参数传递原始数组,它的速度仍然是原来的两倍。

%timeit df.apply(np.sum, raw=True)
840 μs ± 691 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Another example:

另一个例子:

df.apply(lambda x: x.max() - x.min())

A    8
B    8
dtype: int64

df.max() - df.min()

A    8
B    8
dtype: int64

%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()

2.43 ms ± 450 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In general, seek out vectorized alternatives if possible.

一般来说,如果可能寻找矢量化的替代方案。

String/Regex

字符串/正则表达式

Pandas provides "vectorized" string functions in most situations, but there are rare cases where those functions do not... "apply", so to speak.

Pandas 在大多数情况下提供“矢量化”字符串函数,但在极少数情况下,这些函数不......“应用”,可以这么说。

A common problem is to check whether a value in a column is present in another column of the same row.

一个常见的问题是检查列中的值是否存在于同一行的另一列中。

df = pd.DataFrame({
    'Name': ['mickey', 'donald', 'minnie'],
    'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
    'Value': [20, 10, 86]})
df

     Name  Value                       Title
0  mickey     20                  wonderland
1  donald     10  welcome to donald's castle
2  minnie     86      Minnie mouse clubhouse

This should return the row second and third row, since "donald" and "minnie" are present in their respective "Title" columns.

这应该返回第二行和第三行,因为“donald”和“minnie”出现在它们各自的“Title”列中。

Using apply, this would be done using

使用应用,这将使用

df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)

0    False
1     True
2     True
dtype: bool

df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

However, a better solution exists using list comprehensions.

但是,使用列表推导式存在更好的解决方案。

df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

2.85 ms ± 38.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 μs ± 16.4 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The thing to note here is that iterative routines happen to be faster than apply, because of the lower overhead. If you need to handle NaNs and invalid dtypes, you can build on this using a custom function you can then call with arguments inside the list comprehension.

这里要注意的是apply,由于开销较低,迭代例程恰好比 快。如果您需要处理 NaN 和无效的 dtypes,您可以使用自定义函数在此基础上进行构建,然后您可以使用列表推导式中的参数进行调用。

For more information on when list comprehensions should be considered a good option, see my writeup: For loops with pandas - When should I care?.

有关何时应将列表推导视为一个不错的选择的更多信息,请参阅我的文章:For loops with pandas - 我什么时候应该关心?.

Note
Date and datetime operations also have vectorized versions. So, for example, you should prefer pd.to_datetime(df['date']), over, say, df['date'].apply(pd.to_datetime).

Read more at the docs.

注意
日期和日期时间操作也有矢量化版本。因此,例如,您应该更喜欢pd.to_datetime(df['date']), 而不是说df['date'].apply(pd.to_datetime)

文档中阅读更多内容 。

A Common Pitfall: Exploding Columns of Lists

一个常见的陷阱:爆炸的列表列

s = pd.Series([[1, 2]] * 3)
s

0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

People are tempted to use apply(pd.Series). This is horriblein terms of performance.

人们很想使用apply(pd.Series). 这在性能方面是可怕的。

s.apply(pd.Series)

   0  1
0  1  2
1  1  2
2  1  2

A better option is to listify the column and pass it to pd.DataFrame.

更好的选择是列出列并将其传递给 pd.DataFrame。

pd.DataFrame(s.tolist())

   0  1
0  1  2
1  1  2
2  1  2

%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())

2.65 ms ± 294 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 μs ± 40.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Lastly,

最后,

"Are there any situations whereapplyis good?"

有没有apply好的情况?

Apply is a convenience function, so there aresituations where the overhead is negligible enough to forgive. It really depends on how many times the function is called.

应用是一个方便的功能,所以这里的开销可以忽略不计,足以原谅的情况。这实际上取决于函数被调用的次数。

Functions that are Vectorized for Series, but not DataFrames
What if you want to apply a string operation on multiple columns? What if you want to convert multiple columns to datetime? These functions are vectorized for Series only, so they must be appliedover each column that you want to convert/operate on.

为系列矢量化的函数,但不是为数据帧矢量化的函数,
如果您想对多列应用字符串操作怎么办?如果要将多列转换为日期时间怎么办?这些函数仅针对 Series 进行了矢量化处理,因此必须将它们应用于要转换/操作的每一列。

df = pd.DataFrame(
         pd.date_range('2018-12-31','2019-01-31', freq='2D').date.astype(str).reshape(-1, 2), 
         columns=['date1', 'date2'])
df

       date1      date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30

df.dtypes

date1    object
date2    object
dtype: object

This is an admissible case for apply:

这是一个可接受的案例apply

df.apply(pd.to_datetime, errors='coerce').dtypes

date1    datetime64[ns]
date2    datetime64[ns]
dtype: object

Note that it would also make sense to stack, or just use an explicit loop. All these options are slightly faster than using apply, but the difference is small enough to forgive.

请注意,stack使用或仅使用显式循环也是有意义的。所有这些选项都比使用 略快apply,但差异小到可以原谅。

%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')

5.49 ms ± 247 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can make a similar case for other operations such as string operations, or conversion to category.

您可以对其他操作(例如字符串操作或转换为类别)进行类似的处理。

u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))

v/s

比/秒

u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
    v[c] = df[c].astype(category)

And so on...

等等...

Converting Series to str: astypeversus apply

将系列转换为str:astypeapply

This seems like an idiosyncrasy of the API. Using applyto convert integers in a Series to string is comparable (and sometimes faster) than using astype.

这似乎是 API 的一个特性。使用apply将系列中的整数转换为字符串与使用astype.

enter image description hereThe graph was plotted using the perfplotlibrary.

在此处输入图片说明该图是使用perfplot库绘制的。

import perfplot

perfplot.show(
    setup=lambda n: pd.Series(np.random.randint(0, n, n)),
    kernels=[
        lambda s: s.astype(str),
        lambda s: s.apply(str)
    ],
    labels=['astype', 'apply'],
    n_range=[2**k for k in range(1, 20)],
    xlabel='N',
    logx=True,
    logy=True,
    equality_check=lambda x, y: (x == y).all())

With floats, I see the astypeis consistently as fast as, or slightly faster than apply. So this has to do with the fact that the data in the test is integer type.

对于浮动,我看到astype始终与apply. 所以这与测试中的数据是整数类型的事实有关。

GroupByoperations with chained transformations

GroupBy链式转换操作

GroupBy.applyhas not been discussed until now, but GroupBy.applyis also an iterative convenience function to handle anything that the existing GroupByfunctions do not.

GroupBy.apply直到现在还没有讨论过,但GroupBy.apply它也是一个迭代的便利函数,可以处理现有GroupBy函数没有的任何事情。

One common requirement is to perform a GroupBy and then two prime operations such as a "lagged cumsum":

一个常见的要求是先执行 GroupBy,然后执行两个主要操作,例如“滞后累积和”:

df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df

   A   B
0  a  12
1  a   7
2  b   5
3  c   4
4  c   5
5  c   4
6  d   3
7  d   2
8  e   1
9  e  10

You'd need two successive groupby calls here:

您需要在这里进行两次连续的 groupby 调用:

df.groupby('A').B.cumsum().groupby(df.A).shift()

0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

Using apply, you can shorten this to a a single call.

使用apply,您可以将其缩短为单个调用。

df.groupby('A').B.apply(lambda x: x.cumsum().shift())

0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

It is very hard to quantify the performance because it depends on the data. But in general, applyis an acceptable solution if the goal is to reduce a groupbycall (because groupbyis also quite expensive).

量化性能非常困难,因为它取决于数据。但总的来说,apply如果目标是减少groupby通话,这是一个可以接受的解决方案(因为groupby也相当昂贵)。



Other Caveats

其他注意事项

Aside from the caveats mentioned above, it is also worth mentioning that applyoperates on the first row (or column) twice. This is done to determine whether the function has any side effects. If not, applymay be able to use a fast-path for evaluating the result, else it falls back to a slow implementation.

除了上面提到的注意事项之外,还值得一提的是apply,对第一行(或列)进行了两次操作。这样做是为了确定该功能是否有任何副作用。如果没有,apply也许可以使用快速路径来评估结果,否则它会退回到缓慢的实现。

df = pd.DataFrame({
    'A': [1, 2],
    'B': ['x', 'y']
})

def func(x):
    print(x['A'])
    return x

df.apply(func, axis=1)

# 1
# 1
# 2
   A  B
0  1  x
1  2  y

This behaviour is also seen in GroupBy.applyon pandas versions <0.25 (it was fixed for 0.25, see here for more information.)

此行为也出现在GroupBy.applyPandas版本 <0.25 上(它已修复为 0.25,请参阅此处了解更多信息。)

回答by jpp

Not all applys are alike

并非所有的applys 都一样

The below chart suggests when to consider apply1. Green means possibly efficient; red avoid.

下表建议何时考虑apply1。绿色意味着可能有效;红色避免。

enter image description here

在此处输入图片说明

Someof this is intuitive: pd.Series.applyis a Python-level row-wise loop, ditto pd.DataFrame.applyrow-wise (axis=1). The misuses of these are many and wide-ranging. The other post deals with them in more depth. Popular solutions are to use vectorised methods, list comprehensions (assumes clean data), or efficient tools such as the pd.DataFrameconstructor (e.g. to avoid apply(pd.Series)).

其中一些是直观的:pd.Series.apply是 Python 级别的按行循环,同上按pd.DataFrame.apply行 ( axis=1)。这些的滥用是多方面的,而且范围很广。另一篇文章更深入地讨论了它们。流行的解决方案是使用向量化方法、列表推导式(假设数据干净)或高效工具,例如pd.DataFrame构造函数(例如避免apply(pd.Series))。

If you are using pd.DataFrame.applyrow-wise, specifying raw=True(where possible) is often beneficial. At this stage, numbais usually a better choice.

如果您按pd.DataFrame.apply行使用,指定raw=True(在可能的情况下)通常是有益的。在这个阶段,numba通常是更好的选择。

GroupBy.apply: generally favoured

GroupBy.apply: 普遍青睐

Repeating groupbyoperations to avoid applywill hurt performance. GroupBy.applyis usually fine here, provided the methods you use in your custom function are themselves vectorised. Sometimes there is no native Pandas method for a groupwise aggregation you wish to apply. In this case, for a small number of groups applywith a custom function may still offer reasonable performance.

groupby避免重复操作apply会损害性能。GroupBy.apply在这里通常很好,前提是您在自定义函数中使用的方法本身是矢量化的。有时,您希望应用的分组聚合没有原生 Pandas 方法。在这种情况下,对于少数apply具有自定义功能的组,仍然可以提供合理的性能。

pd.DataFrame.applycolumn-wise: a mixed bag

pd.DataFrame.apply逐列:混合袋

pd.DataFrame.applycolumn-wise (axis=0) is an interesting case. For a small number of rows versus a large number of columns, it's almost always expensive. For a large number of rows relative to columns, the more common case, you may sometimessee significant performance improvements using apply:

pd.DataFrame.applycolumn-wise ( axis=0) 是一个有趣的例子。对于少量行与大量列,它几乎总是昂贵的。对于相对于列的大量行,更常见的情况是,您有时可能使用apply以下方法看到显着的性能改进:

# Python 3.7, Pandas 0.23.4
np.random.seed(0)
df = pd.DataFrame(np.random.random((10**7, 3)))     # Scenario_1, many rows
df = pd.DataFrame(np.random.random((10**4, 10**3))) # Scenario_2, many columns

                                               # Scenario_1  | Scenario_2
%timeit df.sum()                               # 800 ms      | 109 ms
%timeit df.apply(pd.Series.sum)                # 568 ms      | 325 ms

%timeit df.max() - df.min()                    # 1.63 s      | 314 ms
%timeit df.apply(lambda x: x.max() - x.min())  # 838 ms      | 473 ms

%timeit df.mean()                              # 108 ms      | 94.4 ms
%timeit df.apply(pd.Series.mean)               # 276 ms      | 233 ms


1There are exceptions, but these are usually marginal or uncommon. A couple of examples:

1也有例外,但这些通常很少或不常见。几个例子:

  1. df['col'].apply(str)may slightly outperform df['col'].astype(str).
  2. df.apply(pd.to_datetime)working on strings doesn't scale well with rows versus a regular forloop.
  1. df['col'].apply(str)可能略微跑赢大盘df['col'].astype(str)
  2. df.apply(pd.to_datetime)与常规for循环相比,处理字符串不能很好地扩展行。

回答by Pete Cacioppi

For axis=1(i.e. row-wise functions) then you can just use the following function in lieu of apply. I wonder why this isn't the pandasbehavior. (Untested with compound indexes, but it does appear to be much faster than apply)

对于axis=1(即逐行函数),您可以使用以下函数代替apply. 我想知道为什么这不是pandas行为。(未经复合索引测试,但它似乎比 快得多apply

def faster_df_apply(df, func):
    cols = list(df.columns)
    data, index = [], []
    for row in df.itertuples(index=True):
        row_dict = {f:v for f,v in zip(cols, row[1:])}
        data.append(func(row_dict))
        index.append(row[0])
    return pd.Series(data, index=index)

回答by astro123

Are there ever any situations where applyis good? Yes, sometimes.

有没有什么情况apply是好的?是的,有时。

Task: decode Unicode strings.

任务:解码 Unicode 字符串。

import numpy as np
import pandas as pd
import unidecode

s = pd.Series(['ma?ana','Ce?ía'])
s.head()
0    ma?ana
1     Ce?ía


s.apply(unidecode.unidecode)
0    manana
1     Cenia

Update
I was by no means advocating for the use of apply, just thinking since the NumPycannot deal with the above situation, it could have been a good candidate for pandas apply. But I was forgetting the plain ol list comprehension thanks to the reminder by @jpp.

更新
我绝不提倡使用apply,只是认为既然NumPy无法处理上述情况,它本来可以成为pandas apply. 但是由于@jpp 的提醒,我忘记了简单的 ol 列表理解。