python pandas：为什么地图更快？

Question

提问by James Bond

in pandas' manual, there is this example about indexing:

在Pandas的手册中，有一个关于索引的例子：

In [653]: criterion = df2['a'].map(lambda x: x.startswith('t'))
In [654]: df2[criterion]

then Wes wrote:

然后韦斯写道：

**# equivalent but slower**
In [655]: df2[[x.startswith('t') for x in df2['a']]]

can anyone here explain a bit why the map approach is faster? Is this a python feature or this is a pandas feature?

这里有人能解释一下为什么地图方法更快吗？这是python功能还是pandas功能？

Answer 1

回答by DSM

Arguments about why a certain way of doing things in Python "should be" faster can't be taken too seriously, because you're often measuring implementation details which may behave differently in certain situations. As a result, when people guess what should be faster, they're often (usually?) wrong. For example, I find that mapcan actually be slower. Using this setup code:

关于为什么 Python 中的某种做事方式“应该”更快的争论不能太当真，因为您经常测量在某些情况下可能表现不同的实现细节。结果，当人们猜测什么应该更快时，他们通常（通常？）错了。例如，我发现这map实际上可能会更慢。使用此设置代码：

import numpy as np, pandas as pd
import random, string

def make_test(num, width):
    s = [''.join(random.sample(string.ascii_lowercase, width)) for i in range(num)]
    df = pd.DataFrame({"a": s})
    return df

Let's compare the time they take to make the indexing object -- whether a Seriesor a list-- and the resulting time it takes to use that object to index into the DataFrame. It could be, for example, that making a list is fast but before using it as an index it needs to be internally converted to a Seriesor an ndarrayor something and so there's extra time added there.

让我们比较他们制作索引对象所花费的时间——无论是 aSeries还是lista——以及使用该对象索引到DataFrame. 例如，创建一个列表可能很快，但在将它用作索引之前，它需要在内部转换为 aSeries或 anndarray或其他东西，因此在那里添加了额外的时间。

First, for a small frame:

首先，对于小框架：

>>> df = make_test(10, 10)
>>> %timeit df['a'].map(lambda x: x.startswith('t'))
10000 loops, best of 3: 85.8 μs per loop
>>> %timeit [x.startswith('t') for x in df['a']]
100000 loops, best of 3: 15.6 μs per loop
>>> %timeit df['a'].str.startswith("t")
10000 loops, best of 3: 118 μs per loop
>>> %timeit df[df['a'].map(lambda x: x.startswith('t'))]
1000 loops, best of 3: 304 μs per loop
>>> %timeit df[[x.startswith('t') for x in df['a']]]
10000 loops, best of 3: 194 μs per loop
>>> %timeit df[df['a'].str.startswith("t")]
1000 loops, best of 3: 348 μs per loop

and in this case the listcomp is fastest. That doesn't actually surprise me too much, to be honest, because going via a lambdais likely to be slower than using str.startswithdirectly, but it's really hard to guess. 10 is small enough we're probably still measuring things like setup costs for Series; what happens in a larger frame?

在这种情况下，listcomp 是最快的。老实说，这实际上并没有让我感到惊讶，因为通过 alambda可能比str.startswith直接使用要慢，但真的很难猜测。10 足够小了，我们可能仍在测量诸如设置成本之类的东西Series；在更大的框架中会发生什么？

>>> df = make_test(10**5, 10)
>>> %timeit df['a'].map(lambda x: x.startswith('t'))
10 loops, best of 3: 46.6 ms per loop
>>> %timeit [x.startswith('t') for x in df['a']]
10 loops, best of 3: 27.8 ms per loop
>>> %timeit df['a'].str.startswith("t")
10 loops, best of 3: 48.5 ms per loop
>>> %timeit df[df['a'].map(lambda x: x.startswith('t'))]
10 loops, best of 3: 47.1 ms per loop
>>> %timeit df[[x.startswith('t') for x in df['a']]]
10 loops, best of 3: 52.8 ms per loop
>>> %timeit df[df['a'].str.startswith("t")]
10 loops, best of 3: 49.6 ms per loop

And now it seems like the mapis winning when used as an index, although the difference is marginal. But not so fast: what if we manually turn the listcomp into an arrayor a Series?

现在，map当用作索引时，它似乎正在获胜，尽管差异很小。但不是那么快：如果我们手动将 listcomp 转换为 anarray或 aSeries怎么办？

>>> %timeit df[np.array([x.startswith('t') for x in df['a']])]
10 loops, best of 3: 40.7 ms per loop
>>> %timeit df[pd.Series([x.startswith('t') for x in df['a']])]
10 loops, best of 3: 37.5 ms per loop

and now the listcomp wins again!

现在 listcomp 再次获胜！

Conclusion: who knows? But never believe anything without timeitresults, and even then you have to ask whether you're testing what you think you are.

结论：谁知道？但是永远不要相信没有timeit结果的任何事情，即便如此，您也必须问自己是否在测试自己的想法。

python pandas：为什么地图更快？

提问by James Bond

回答by DSM

相关推荐

最近更新

标签

python pandas：为什么地图更快？

提问by James Bond

回答by DSM

相关推荐

pandas Python 中的回归

通过从每行的不同列中选择一个元素，从 Pandas DataFrame 创建一个系列

pandas 如何在 Python 中读取大文本文件？

如何在 Pandas 数据框中将日期转换为 ISO-8601 DateTime 格式

相关推荐

最近更新

标签