pandas 遍历 numpy 数组的最快方法是什么

Question

提问by piRSquared

I noticed a meaningful difference between iterating through a numpy array "directly" versus iterating through via the tolistmethod. See timing below:

我注意到“直接”迭代一个 numpy 数组与通过该tolist方法迭代之间存在显着差异。看下面的时间：

directly
[i for i in np.arange(10000000)]
via tolist
[i for i in np.arange(10000000).tolist()]

直接
[i for i in np.arange(10000000)]
通过tolist
[i for i in np.arange(10000000).tolist()]

considering I've discovered one way to go faster. I wanted to ask what else might make it go faster?

考虑到我发现了一种更快的方法。我想问一下还有什么可以让它跑得更快？

what is fastest way to iterate through a numpy array?

遍历 numpy 数组的最快方法是什么？

Answer 1

采纳答案by hpaulj

These are my timings on a slower machine

这些是我在较慢的机器上的计时

In [1034]: timeit [i for i in np.arange(10000000)]
1 loop, best of 3: 2.16 s per loop

If I generate the range directly (Py3 so this is a genertor) times are much better. Take this a baseline for a list comprehension of this size.

如果我直接生成范围（Py3 所以这是一个生成器）时间会好得多。以此作为这种规模的列表理解的基线。

In [1035]: timeit [i for i in range(10000000)]
1 loop, best of 3: 1.26 s per loop

tolistconverts the arange to a list first; takes a bit longer, but the iteration is still on a list

tolist首先将 arange 转换为列表；需要更长的时间，但迭代仍在列表中

In [1036]: timeit [i for i in np.arange(10000000).tolist()]
1 loop, best of 3: 1.6 s per loop

Using list()- same time as direct iteration on the array; that suggests that the direct iteration first does this.

使用list()- 与数组上的直接迭代同时进行；这表明直接迭代首先执行此操作。

In [1037]: timeit [i for i in list(np.arange(10000000))]
1 loop, best of 3: 2.18 s per loop

In [1038]: timeit np.arange(10000000).tolist()
1 loop, best of 3: 927 ms per loop

same times a iterating on the .tolist

同时对 .tolist 进行迭代

In [1039]: timeit list(np.arange(10000000))
1 loop, best of 3: 1.55 s per loop

In general if you must loop, working on a list is faster. Access to elements of a list is simpler.

一般来说，如果必须循环，处理列表会更快。访问列表元素更简单。

Look at the elements returned by indexing.

查看索引返回的元素。

a[0]is another numpyobject; it is constructed from the values in a, but not simply a fetched value

a[0]是另一个numpy对象；它是根据中的值构造的a，但不仅仅是获取的值

list(a)[0]is the same type; the list is just [a[0], a[1], a[2]]]

list(a)[0]是同一类型；名单只是[a[0], a[1], a[2]]]

In [1043]: a = np.arange(3)
In [1044]: type(a[0])
Out[1044]: numpy.int32
In [1045]: ll=list(a)
In [1046]: type(ll[0])
Out[1046]: numpy.int32

but tolistconverts the array into a pure list, in this case, as list of ints. It does more work than list(), but does it in compiled code.

但tolist将数组转换为纯列表，在本例中为整数列表。它比做更多的工作list()，但它是在编译后的代码中完成的。

In [1047]: ll=a.tolist()
In [1048]: type(ll[0])
Out[1048]: int

In general don't use list(anarray). It rarely does anything useful, and is not as powerful as tolist().

一般不要使用list(anarray). 它很少做任何有用的事情，而且没有tolist().

What's the fastest way to iterate through array - None. At least not in Python; in c code there are fast ways.

遍历数组的最快方法是什么 - 无。至少在 Python 中不是；在 c 代码中有快速的方法。

a.tolist()is the fastest, vectorized way of creating a list integers from an array. It iterates, but does so in compiled code.

a.tolist()是从数组创建列表整数的最快的矢量化方法。它会迭代，但会在编译后的代码中进行。

But what is your real goal?

但你真正的目标是什么？

Answer 2

回答by James

This is actually not surprising. Let's examine the methods one a time starting with the slowest.

这其实并不奇怪。让我们从最慢的开始，一次检查一种方法。

[i for i in np.arange(10000000)]

This method asks python to reach into the numpy array (stored in the C memory scope), one element at a time, allocate a Python object in memory, and create a pointer to that object in the list. Each time you pipe between the numpy array stored in the C backend and pull it into pure python, there is an overhead cost. This method adds in that cost 10,000,000 times.

此方法要求 python 访问 numpy 数组（存储在 C 内存范围中），一次一个元素，在内存中分配一个 Python 对象，并在列表中创建一个指向该对象的指针。每次在存储在 C 后端的 numpy 数组之间进行管道传输并将其拉入纯 python 时，都会产生开销成本。这种方法将成本增加了 10,000,000 倍。

[i for i in np.arange(10000000).tolist()]

In this case, using .tolist()makes a single call to the numpy C backend and allocates all of the elements in one shot to a list. You then are using python to iterate over that list.

在这种情况下， using.tolist()对 numpy C 后端进行一次调用，并将一次镜头中的所有元素分配给一个列表。然后，您将使用 python 迭代该列表。

Finally:

最后：

list(np.arange(10000000))

This basically does the same thing as above, but it creates a list of numpy's native type objects (e.g. np.int64). Using list(np.arange(10000000))and np.arange(10000000).tolist()should be about the same time.

这基本上与上面的操作相同，但它创建了一个 numpy 的本机类型对象（例如np.int64）的列表。使用list(np.arange(10000000))和np.arange(10000000).tolist()应该大约在同一时间。

So, in terms of iteration, the primary advantage of using numpyis that you don't need to iterate. Operation are applied in an vectorized fashion over the array. Iteration just slows it down. If you find yourself iterating over array elements, you should look into finding a way to restructure the algorithm you are attempting, in such a way that is uses only numpy operations (it has soooo many built-in!) or if really necessary you can use np.apply_along_axis, np.apply_over_axis, or np.vectorize.

因此，在迭代方面，使用的主要优点numpy是您不需要迭代。操作以矢量化方式应用于数组。迭代只会减慢它的速度。如果您发现自己在迭代数组元素，则应该寻找一种方法来重构您正在尝试的算法，这种方式仅使用 numpy 操作（它有太多内置操作！），或者如果真的有必要，您可以使用np.apply_along_axis、np.apply_over_axis、或np.vectorize。

Answer 3

回答by Santhosh

My test case has an numpy array

我的测试用例有一个 numpy array

[[  34  107]
 [ 963  144]
 [ 921 1187]
 [   0 1149]]

I'm going through this only once using rangeand enumerate

我只经历过一次使用range和enumerate

USING range

使用范围

loopTimer1 = default_timer()
for l1 in range(0,4):
    print(box[l1])
print("Time taken by range: ",default_timer()-loopTimer1)

Result

结果

[ 34 107]
[963 144]
[ 921 1187]
[   0 1149]
Time taken by range:  0.0005405639985838206

USING enumerate

使用枚举

loopTimer2 = default_timer()
for l2,v2 in enumerate(box):
    print(box[l2])
print("Time taken by enumerate: ", default_timer() - loopTimer2)

Result

结果

[ 34 107]
[963 144]
[ 921 1187]
[   0 1149]
Time taken by enumerate:  0.00025605700102460105

This test case I picked enumeratewill works faster

我选择的这个测试用例enumerate会运行得更快

pandas 遍历 numpy 数组的最快方法是什么

提问by piRSquared

采纳答案by hpaulj

回答by James

回答by Santhosh

相关推荐

最近更新

标签

pandas 遍历 numpy 数组的最快方法是什么

提问by piRSquared

采纳答案by hpaulj

回答by James

回答by Santhosh

相关推荐

动态在 Pandas 数据框中添加列

Pandas 将字符串列和 NaN（浮点数）转换为整数，保持 NaN

pandas 在 MatPlotLib 中添加下拉列表和文本框，并根据输入显示绘图

pandas 类型错误：“方法”类型的对象没有 len()

相关推荐

最近更新

标签