python pandas:为什么地图更快?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18932254/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:10:39  来源:igfitidea点击:

python pandas: why map is faster?

pythonpandas

提问by James Bond

in pandas' manual, there is this example about indexing:

在Pandas的手册中,有一个关于索引的例子:

In [653]: criterion = df2['a'].map(lambda x: x.startswith('t'))
In [654]: df2[criterion]

then Wes wrote:

然后韦斯写道:

**# equivalent but slower**
In [655]: df2[[x.startswith('t') for x in df2['a']]]

can anyone here explain a bit why the map approach is faster? Is this a python feature or this is a pandas feature?

这里有人能解释一下为什么地图方法更快吗?这是python功能还是pandas功能?

回答by DSM

Arguments about why a certain way of doing things in Python "should be" faster can't be taken too seriously, because you're often measuring implementation details which may behave differently in certain situations. As a result, when people guess what should be faster, they're often (usually?) wrong. For example, I find that mapcan actually be slower. Using this setup code:

关于为什么 Python 中的某种做事方式“应该”更快的争论不能太当真,因为您经常测量在某些情况下可能表现不同的实现细节。结果,当人们猜测什么应该更快时,他们通常(通常?)错了。例如,我发现这map实际上可能会更慢。使用此设置代码:

import numpy as np, pandas as pd
import random, string

def make_test(num, width):
    s = [''.join(random.sample(string.ascii_lowercase, width)) for i in range(num)]
    df = pd.DataFrame({"a": s})
    return df

Let's compare the time they take to make the indexing object -- whether a Seriesor a list-- and the resulting time it takes to use that object to index into the DataFrame. It could be, for example, that making a list is fast but before using it as an index it needs to be internally converted to a Seriesor an ndarrayor something and so there's extra time added there.

让我们比较他们制作索引对象所花费的时间——无论是 aSeries还是lista——以及使用该对象索引到DataFrame. 例如,创建一个列表可能很快,但在将它用作索引之前,它需要在内部转换为 aSeries或 anndarray或其他东西,因此在那里添加了额外的时间。

First, for a small frame:

首先,对于小框架:

>>> df = make_test(10, 10)
>>> %timeit df['a'].map(lambda x: x.startswith('t'))
10000 loops, best of 3: 85.8 μs per loop
>>> %timeit [x.startswith('t') for x in df['a']]
100000 loops, best of 3: 15.6 μs per loop
>>> %timeit df['a'].str.startswith("t")
10000 loops, best of 3: 118 μs per loop
>>> %timeit df[df['a'].map(lambda x: x.startswith('t'))]
1000 loops, best of 3: 304 μs per loop
>>> %timeit df[[x.startswith('t') for x in df['a']]]
10000 loops, best of 3: 194 μs per loop
>>> %timeit df[df['a'].str.startswith("t")]
1000 loops, best of 3: 348 μs per loop

and in this case the listcomp is fastest. That doesn't actually surprise me too much, to be honest, because going via a lambdais likely to be slower than using str.startswithdirectly, but it's really hard to guess. 10 is small enough we're probably still measuring things like setup costs for Series; what happens in a larger frame?

在这种情况下,listcomp 是最快的。老实说,这实际上并没有让我感到惊讶,因为通过 alambda可能比str.startswith直接使用要慢,但真的很难猜测。10 足够小了,我们可能仍在测量诸如设置成本之类的东西Series;在更大的框架中会发生什么?

>>> df = make_test(10**5, 10)
>>> %timeit df['a'].map(lambda x: x.startswith('t'))
10 loops, best of 3: 46.6 ms per loop
>>> %timeit [x.startswith('t') for x in df['a']]
10 loops, best of 3: 27.8 ms per loop
>>> %timeit df['a'].str.startswith("t")
10 loops, best of 3: 48.5 ms per loop
>>> %timeit df[df['a'].map(lambda x: x.startswith('t'))]
10 loops, best of 3: 47.1 ms per loop
>>> %timeit df[[x.startswith('t') for x in df['a']]]
10 loops, best of 3: 52.8 ms per loop
>>> %timeit df[df['a'].str.startswith("t")]
10 loops, best of 3: 49.6 ms per loop

And now it seems like the mapis winning when used as an index, although the difference is marginal. But not so fast: what if we manually turn the listcomp into an arrayor a Series?

现在,map当用作索引时,它似乎正在获胜,尽管差异很小。但不是那么快:如果我们手动将 listcomp 转换为 anarray或 aSeries怎么办?

>>> %timeit df[np.array([x.startswith('t') for x in df['a']])]
10 loops, best of 3: 40.7 ms per loop
>>> %timeit df[pd.Series([x.startswith('t') for x in df['a']])]
10 loops, best of 3: 37.5 ms per loop

and now the listcomp wins again!

现在 listcomp 再次获胜!

Conclusion: who knows? But never believe anything without timeitresults, and even then you have to ask whether you're testing what you think you are.

结论:谁知道?但是永远不要相信没有timeit结果的任何事情,即便如此,您也必须问自己是否在测试自己的想法。