在 pandas Series 中设置值很慢，为什么？

Question

提问by jkokorian

Question

题

Does anyone know why setting an item directly on a pandas series is so incredibly slow? Am I doing something wrong, or is it just the way it is?

有谁知道为什么直接在 Pandas 系列上设置一个项目如此之慢？我做错了什么，还是就是这样？

I ran a couple of tests to see what the fastest method is to set a value on a pandas Series object. Here are the results, ordered from fast to slow:

我进行了几次测试，以了解在 Pandas Series 对象上设置值的最快方法是什么。以下是结果，按从快到慢的顺序排列：

initialize array, set using integer index, create series

初始化数组，使用整数索引设置，创建系列

%%timeit
a = np.empty(1000, dtype='float')
for i in range(len(a)):
    a[i] = 1.0
s = pd.Series(data=a)

1000 loops, best of 3: 630 μs per loop

1000 个循环，最好的 3 个：每个循环 630 μs

create empty list, add item using append, create series

创建空列表，使用追加添加项目，创建系列

%%timeit
l = []
for i in range(1000):
    l.append(1.0)
s = pd.Series(data=l)

1000 loops, best of 3: 1.05 ms per loop

1000 个循环，最好的 3 个：每个循环 1.05 毫秒

initialize array, create series, set using set_value

初始化数组，创建系列，使用 set_value 设置

%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.set_value(i, 1.0)

100 loops, best of 3: 18.5 ms per loop

100 个循环，最好的 3 个：每个循环 18.5 毫秒

initialize array, create series, set using integer index

初始化数组，创建系列，使用整数索引设置

%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s[i] = 1.0

10 loops, best of 3: 30.2 ms per loop

10 个循环，最好的 3 个：每个循环 30.2 毫秒

intialize array, create series, set using iat

初始化数组，创建系列，使用 iat 设置

%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.iat[i] = 1.0

10 loops, best of 3: 36.2 ms per loop

10 个循环，最好的 3 个：每个循环 36.2 毫秒

initialize array, create series, set using iloc

初始化数组，创建系列，使用 iloc 设置

%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.iloc[i] = 1.0

1 loops, best of 3: 280 ms per loop

1 个循环，最好的 3 个：每个循环 280 毫秒

Answer 1

采纳答案by EdChum

From the docs

从文档

Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you're asking for.

由于使用 [] 进行索引必须处理很多情况（单标签访问、切片、布尔索引等），因此它需要一些开销才能确定您的要求。

So I get the following which should be comparable:

所以我得到以下应该可以比较的：

In [13]:

%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.iat[i] = 1.0
10 loops, best of 3: 23.3 ms per loop
In [14]:

%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.iloc[i] = 1.0
10 loops, best of 3: 159 ms per loop

for the other tests:

对于其他测试：

In [15]:

%%timeit
l = []
for i in range(1000):
    l.append(1.0)
s = pd.Series(data=l)
1000 loops, best of 3: 525 μs per loop
In [16]:

%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.set_value(i,1.0)
100 loops, best of 3: 10.1 ms per loop
In [17]:

%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s[i] = 1.0
100 loops, best of 3: 17.5 ms per loop

Answer 2

回答by Alexander

I think these methods are even faster for initializing a series to a constant value:

我认为这些方法可以更快地将系列初始化为恒定值：

Base Line

基线

%%timeit
a = np.empty(1000, dtype='float')
for i in range(len(a)):
    a[i] = 1.0
s = pd.Series(data=a)

10000 loops, best of 3: 121 μs per loop

Alternatives

备择方案

%%timeit
s = pd.Series(np.empty(1000, dtype='float')) * 1.

10000 loops, best of 3: 99.5 μs per loop

%%timeit
constant = 5.
s = pd.Series(np.ones(1000)) * constant

10000 loops, best of 3: 85.3 μs per loop

Answer 3

回答by jkokorian

I figured out how to get past the indexing overhead when setting values on a series object directly:

我想出了如何在直接在系列对象上设置值时绕过索引开销：

a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    a[i] = 1.0

When initializing the Series from a numpy array, the data is not copied. If a reference is kept to the original array, you can just set values on that!

从 numpy 数组初始化系列时，不会复制数据。如果保留对原始数组的引用，则只需在其上设置值即可！

在 pandas Series 中设置值很慢，为什么？

提问by jkokorian

Question

题

initialize array, set using integer index, create series

初始化数组，使用整数索引设置，创建系列

create empty list, add item using append, create series

创建空列表，使用追加添加项目，创建系列

initialize array, create series, set using set_value

初始化数组，创建系列，使用 set_value 设置

initialize array, create series, set using integer index

初始化数组，创建系列，使用整数索引设置

intialize array, create series, set using iat

初始化数组，创建系列，使用 iat 设置

initialize array, create series, set using iloc

初始化数组，创建系列，使用 iloc 设置

采纳答案by EdChum

回答by Alexander

回答by jkokorian

相关推荐

最近更新

标签

在 pandas Series 中设置值很慢，为什么？

提问by jkokorian

Question

题

initialize array, set using integer index, create series

初始化数组，使用整数索引设置，创建系列

create empty list, add item using append, create series

创建空列表，使用追加添加项目，创建系列

initialize array, create series, set using set_value

初始化数组，创建系列，使用 set_value 设置

initialize array, create series, set using integer index

初始化数组，创建系列，使用整数索引设置

intialize array, create series, set using iat

初始化数组，创建系列，使用 iat 设置

initialize array, create series, set using iloc

初始化数组，创建系列，使用 iloc 设置

采纳答案by EdChum

回答by Alexander

回答by jkokorian

相关推荐

pandas 非常嘈杂的信号的 Scipy FFT 频率分析

pandas ValueError: 预期 n_neighbors <= 1. Got 5 -Scikit K 最近分类器

在 Pandas Grouby 数据框上建立索引给出错误

pandas 熊猫，按列和行选择

相关推荐

最近更新

标签