在 pandas Series 中设置值很慢,为什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30267338/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:21:18  来源:igfitidea点击:

Setting values in pandas Series is slow, why?

pythonpandas

提问by jkokorian

Question

Does anyone know why setting an item directly on a pandas series is so incredibly slow? Am I doing something wrong, or is it just the way it is?

有谁知道为什么直接在 Pandas 系列上设置一个项目如此之慢?我做错了什么,还是就是这样?

I ran a couple of tests to see what the fastest method is to set a value on a pandas Series object. Here are the results, ordered from fast to slow:

我进行了几次测试,以了解在 Pandas Series 对象上设置值的最快方法是什么。以下是结果,按从快到慢的顺序排列:

initialize array, set using integer index, create series

初始化数组,使用整数索引设置,创建系列

%%timeit
a = np.empty(1000, dtype='float')
for i in range(len(a)):
    a[i] = 1.0
s = pd.Series(data=a)

1000 loops, best of 3: 630 μs per loop

1000 个循环,最好的 3 个:每个循环 630 μs

create empty list, add item using append, create series

创建空列表,使用追加添加项目,创建系列

%%timeit
l = []
for i in range(1000):
    l.append(1.0)
s = pd.Series(data=l)

1000 loops, best of 3: 1.05 ms per loop

1000 个循环,最好的 3 个:每个循环 1.05 毫秒

initialize array, create series, set using set_value

初始化数组,创建系列,使用 set_value 设置

%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.set_value(i, 1.0)

100 loops, best of 3: 18.5 ms per loop

100 个循环,最好的 3 个:每个循环 18.5 毫秒

initialize array, create series, set using integer index

初始化数组,创建系列,使用整数索引设置

%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s[i] = 1.0

10 loops, best of 3: 30.2 ms per loop

10 个循环,最好的 3 个:每个循环 30.2 毫秒

intialize array, create series, set using iat

初始化数组,创建系列,使用 iat 设置

%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.iat[i] = 1.0

10 loops, best of 3: 36.2 ms per loop

10 个循环,最好的 3 个:每个循环 36.2 毫秒

initialize array, create series, set using iloc

初始化数组,创建系列,使用 iloc 设置

%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.iloc[i] = 1.0

1 loops, best of 3: 280 ms per loop

1 个循环,最好的 3 个:每个循环 280 毫秒

采纳答案by EdChum

From the docs

文档

Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you're asking for.

由于使用 [] 进行索引必须处理很多情况(单标签访问、切片、布尔索引等),因此它需要一些开销才能确定您的要求。

So I get the following which should be comparable:

所以我得到以下应该可以比较的:

In [13]:

%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.iat[i] = 1.0
10 loops, best of 3: 23.3 ms per loop
In [14]:

%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.iloc[i] = 1.0
10 loops, best of 3: 159 ms per loop

for the other tests:

对于其他测试:

In [15]:

%%timeit
l = []
for i in range(1000):
    l.append(1.0)
s = pd.Series(data=l)
1000 loops, best of 3: 525 μs per loop
In [16]:

%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.set_value(i,1.0)
100 loops, best of 3: 10.1 ms per loop
In [17]:

%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s[i] = 1.0
100 loops, best of 3: 17.5 ms per loop

回答by Alexander

I think these methods are even faster for initializing a series to a constant value:

我认为这些方法可以更快地将系列初始化为恒定值:

Base Line

基线

%%timeit
a = np.empty(1000, dtype='float')
for i in range(len(a)):
    a[i] = 1.0
s = pd.Series(data=a)

10000 loops, best of 3: 121 μs per loop

Alternatives

备择方案

%%timeit
s = pd.Series(np.empty(1000, dtype='float')) * 1.

10000 loops, best of 3: 99.5 μs per loop

%%timeit
constant = 5.
s = pd.Series(np.ones(1000)) * constant

10000 loops, best of 3: 85.3 μs per loop

回答by jkokorian

I figured out how to get past the indexing overhead when setting values on a series object directly:

我想出了如何在直接在系列对象上设置值时绕过索引开销:

a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    a[i] = 1.0

When initializing the Series from a numpy array, the data is not copied. If a reference is kept to the original array, you can just set values on that!

从 numpy 数组初始化系列时,不会复制数据。如果保留对原始数组的引用,则只需在其上设置值即可!