Python numpy 和 matlab 之间的性能差异
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18516605/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Difference on performance between numpy and matlab
提问by pabaldonedo
I am computing the backpropagation
algorithm for a sparse autoencoder. I have implemented it in python using numpy
and in matlab
. The code is almost the same, but the performance is very different. The time matlab takes to complete the task is 0.252454 seconds while numpy 0.973672151566, that is almost four times more. I will call this code several times later in a minimization problem so this difference leads to several minutes of delay between the implementations. Is this a normal behaviour? How could I improve the performance in numpy?
我正在计算backpropagation
稀疏自动编码器的算法。我已经在 python 中使用numpy
和 in实现了它matlab
。代码几乎相同,但性能却大不相同。matlab 完成任务所需的时间为 0.252454 秒,而 numpy 为 0.973672151566,几乎是四倍。稍后我将在最小化问题中多次调用此代码,因此这种差异会导致实现之间出现几分钟的延迟。这是正常行为吗?我怎样才能提高 numpy 的性能?
Numpy implementation:
Numpy 实现:
Sparse.rho is a tuning parameter, sparse.nodes are the number of nodes in the hidden layer (25), sparse.input (64) the number of nodes in the input layer, theta1 and theta2 are the weight matrices for the first and second layer respectively with dimensions 25x64 and 64x25, m is equal to 10000, rhoest has a dimension of (25,), x has a dimension of 10000x64, a3 10000x64 and a2 10000x25.
Sparse.rho 是一个调整参数,sparse.nodes 是隐藏层的节点数(25),sparse.input(64)是输入层的节点数,theta1 和 theta2 是第一个和第二层分别具有尺寸25x64和64x25,m等于10000,rhoest的尺寸为(25,),x的尺寸为10000x64,a3 10000x64和a2 10000x25。
UPDATE
: I have introduced changes in the code following some of the ideas of the responses. The performance is now numpy: 0.65 vs matlab: 0.25.
UPDATE
:我在响应的一些想法之后对代码进行了更改。性能现在是 numpy: 0.65 vs matlab: 0.25。
partial_j1 = np.zeros(sparse.theta1.shape)
partial_j2 = np.zeros(sparse.theta2.shape)
partial_b1 = np.zeros(sparse.b1.shape)
partial_b2 = np.zeros(sparse.b2.shape)
t = time.time()
delta3t = (-(x-a3)*a3*(1-a3)).T
for i in range(m):
delta3 = delta3t[:,i:(i+1)]
sum1 = np.dot(sparse.theta2.T,delta3)
delta2 = ( sum1 + sum2 ) * a2[i:(i+1),:].T* (1 - a2[i:(i+1),:].T)
partial_j1 += np.dot(delta2, a1[i:(i+1),:])
partial_j2 += np.dot(delta3, a2[i:(i+1),:])
partial_b1 += delta2
partial_b2 += delta3
print "Backprop time:", time.time() -t
Matlab implementation:
Matlab 实现:
tic
for i = 1:m
delta3 = -(data(i,:)-a3(i,:)).*a3(i,:).*(1 - a3(i,:));
delta3 = delta3.';
sum1 = W2.'*delta3;
sum2 = beta*(-sparsityParam./rhoest + (1 - sparsityParam) ./ (1.0 - rhoest) );
delta2 = ( sum1 + sum2 ) .* a2(i,:).' .* (1 - a2(i,:).');
W1grad = W1grad + delta2* a1(i,:);
W2grad = W2grad + delta3* a2(i,:);
b1grad = b1grad + delta2;
b2grad = b2grad + delta3;
end
toc
采纳答案by unutbu
It would be wrong to say "Matlab is always faster than NumPy" or vice versa. Often their performance is comparable. When using NumPy, to get good performance you have to keep in mind that NumPy's speed comes from calling underlying functions written in C/C++/Fortran. It performs well when you apply those functions to whole arrays. In general, you get poorer performance when you call those NumPy function on smaller arrays or scalars in a Python loop.
说“Matlab 总是比 NumPy 快”是错误的,反之亦然。通常它们的性能是可比的。使用 NumPy 时,为了获得良好的性能,您必须记住 NumPy 的速度来自调用用 C/C++/Fortran 编写的底层函数。当您将这些函数应用于整个数组时,它表现良好。通常,在 Python 循环中对较小数组或标量调用这些 NumPy 函数时,性能会降低。
What's wrong with a Python loop you ask? Every iteration through the Python loop is
a call to a next
method. Every use of []
indexing is a call to a
__getitem__
method. Every +=
is a call to __iadd__
. Every dotted attribute
lookup (such as in like np.dot
) involves function calls. Those function calls
add up to a significant hinderance to speed. These hooks give Python
expressive power -- indexing for strings means something different than indexing
for dicts for example. Same syntax, different meanings. The magic is accomplished by giving the objects different __getitem__
methods.
你问的 Python 循环有什么问题?通过 Python 循环的每次迭代都是对next
方法的调用。每次使用[]
索引都是对__getitem__
方法的调用
。每个+=
都是对 的调用__iadd__
。每个点属性查找(例如在 like 中np.dot
)都涉及函数调用。这些函数调用加起来严重阻碍了速度。这些钩子赋予 Python 表达能力——例如,为字符串建立索引意味着与为字典建立索引不同。相同的语法,不同的含义。魔法是通过给对象不同的__getitem__
方法来实现的。
But that expressive power comes at a cost in speed. So when you don't need all that dynamic expressivity, to get better performance, try to limit yourself to NumPy function calls on whole arrays.
但这种表现力是以速度为代价的。因此,当您不需要所有动态表现力时,为了获得更好的性能,请尝试将自己限制为对整个数组进行 NumPy 函数调用。
So, remove the for-loop; use "vectorized" equations when possible. For example, instead of
因此,删除 for 循环;尽可能使用“矢量化”方程。例如,代替
for i in range(m):
delta3 = -(x[i,:]-a3[i,:])*a3[i,:]* (1 - a3[i,:])
you can compute delta3
for each i
all at once:
您可以一次delta3
为每个计算i
:
delta3 = -(x-a3)*a3*(1-a3)
Whereas in the for-loop
delta3
is a vector, using the vectorized equation delta3
is a matrix.
而在for-loop
delta3
是一个向量,使用向量化方程delta3
是一个矩阵。
Some of the computations in the for-loop
do not depend on i
and therefore should be lifted outside the loop. For example, sum2
looks like a constant:
中的一些计算for-loop
不依赖i
,因此应该在循环之外解除。例如,sum2
看起来像一个常量:
sum2 = sparse.beta*(-float(sparse.rho)/rhoest + float(1.0 - sparse.rho) / (1.0 - rhoest) )
Here is a runnable example with an alternative implementation (alt
) of your code (orig
).
这是一个可运行的示例,其中包含alt
代码 ( orig
)的替代实现( )。
My timeit benchmark shows a 6.8x improvement in speed:
我的 timeit 基准测试显示速度提高了6.8 倍:
In [52]: %timeit orig()
1 loops, best of 3: 495 ms per loop
In [53]: %timeit alt()
10 loops, best of 3: 72.6 ms per loop
import numpy as np
class Bunch(object):
""" http://code.activestate.com/recipes/52308 """
def __init__(self, **kwds):
self.__dict__.update(kwds)
m, n, p = 10 ** 4, 64, 25
sparse = Bunch(
theta1=np.random.random((p, n)),
theta2=np.random.random((n, p)),
b1=np.random.random((p, 1)),
b2=np.random.random((n, 1)),
)
x = np.random.random((m, n))
a3 = np.random.random((m, n))
a2 = np.random.random((m, p))
a1 = np.random.random((m, n))
sum2 = np.random.random((p, ))
sum2 = sum2[:, np.newaxis]
def orig():
partial_j1 = np.zeros(sparse.theta1.shape)
partial_j2 = np.zeros(sparse.theta2.shape)
partial_b1 = np.zeros(sparse.b1.shape)
partial_b2 = np.zeros(sparse.b2.shape)
delta3t = (-(x - a3) * a3 * (1 - a3)).T
for i in range(m):
delta3 = delta3t[:, i:(i + 1)]
sum1 = np.dot(sparse.theta2.T, delta3)
delta2 = (sum1 + sum2) * a2[i:(i + 1), :].T * (1 - a2[i:(i + 1), :].T)
partial_j1 += np.dot(delta2, a1[i:(i + 1), :])
partial_j2 += np.dot(delta3, a2[i:(i + 1), :])
partial_b1 += delta2
partial_b2 += delta3
# delta3: (64, 1)
# sum1: (25, 1)
# delta2: (25, 1)
# a1[i:(i+1),:]: (1, 64)
# partial_j1: (25, 64)
# partial_j2: (64, 25)
# partial_b1: (25, 1)
# partial_b2: (64, 1)
# a2[i:(i+1),:]: (1, 25)
return partial_j1, partial_j2, partial_b1, partial_b2
def alt():
delta3 = (-(x - a3) * a3 * (1 - a3)).T
sum1 = np.dot(sparse.theta2.T, delta3)
delta2 = (sum1 + sum2) * a2.T * (1 - a2.T)
# delta3: (64, 10000)
# sum1: (25, 10000)
# delta2: (25, 10000)
# a1: (10000, 64)
# a2: (10000, 25)
partial_j1 = np.dot(delta2, a1)
partial_j2 = np.dot(delta3, a2)
partial_b1 = delta2.sum(axis=1)
partial_b2 = delta3.sum(axis=1)
return partial_j1, partial_j2, partial_b1, partial_b2
answer = orig()
result = alt()
for a, r in zip(answer, result):
try:
assert np.allclose(np.squeeze(a), r)
except AssertionError:
print(a.shape)
print(r.shape)
raise
Tip:Notice that I left in the comments the shape of all the intermediate arrays. Knowing the shape of the arrays helped me understand what your code was doing. The shape of the arrays can help guide you toward the right NumPy functions to use. Or at least, paying attention to the shapes can help you know if an operation is sensible. For example, when you compute
提示:请注意,我在注释中留下了所有中间数组的形状。了解数组的形状有助于我了解您的代码在做什么。数组的形状可以帮助指导您使用正确的 NumPy 函数。或者至少,注意形状可以帮助您了解操作是否合理。例如,当您计算
np.dot(A, B)
and A.shape = (n, m)
and B.shape = (m, p)
, then np.dot(A, B)
will be an array of shape (n, p)
.
和A.shape = (n, m)
和B.shape = (m, p)
,然后np.dot(A, B)
将形状的阵列(n, p)
。
It can help to build the arrays in C_CONTIGUOUS-order (at least, if using np.dot
). There might be as much as a 3x speed up by doing so:
它可以帮助以 C_CONTIGUOUS 顺序构建数组(至少,如果使用np.dot
)。这样做可能会提高 3 倍的速度:
Below, x
is the same as xf
except that x
is C_CONTIGUOUS and
xf
is F_CONTIGUOUS -- and the same relationship for y
and yf
.
下面,x
是一样的xf
,除了x
是C_CONTIGUOUS并且
xf
是F_CONTIGUOUS -对于同样的关系y
和yf
。
import numpy as np
m, n, p = 10 ** 4, 64, 25
x = np.random.random((n, m))
xf = np.asarray(x, order='F')
y = np.random.random((m, n))
yf = np.asarray(y, order='F')
assert np.allclose(x, xf)
assert np.allclose(y, yf)
assert np.allclose(np.dot(x, y), np.dot(xf, y))
assert np.allclose(np.dot(x, y), np.dot(xf, yf))
%timeit
benchmarks show the difference in speed:
%timeit
基准测试显示速度的差异:
In [50]: %timeit np.dot(x, y)
100 loops, best of 3: 12.9 ms per loop
In [51]: %timeit np.dot(xf, y)
10 loops, best of 3: 27.7 ms per loop
In [56]: %timeit np.dot(x, yf)
10 loops, best of 3: 21.8 ms per loop
In [53]: %timeit np.dot(xf, yf)
10 loops, best of 3: 33.3 ms per loop
Regarding benchmarking in Python:
关于 Python 中的基准测试:
It can be misleadingto use the difference in pairs of time.time()
calls to benchmark the speed of code in Python.
You need to repeat the measurement many times. It's better to disable the automatic garbage collector. It is also important to measure large spans of time (such as at least 10 seconds worth of repetitions) to avoid errors due to poor resolution in the clock timer and to reduce the significance of time.time
call overhead. Instead of writing all that code yourself, Python provides you with the timeit module. I'm essentially using that to time the pieces of code, except that I'm calling it through an IPython terminalfor convenience.
使用time.time()
调用对的差异来衡量 Python 中的代码速度可能会产生误导。您需要多次重复测量。最好禁用自动垃圾收集器。测量大的时间跨度(例如至少 10 秒的重复)以避免由于时钟计时器分辨率差而导致的错误并降低time.time
呼叫开销的重要性也很重要。Python 为您提供timeit 模块,而不是自己编写所有代码。我基本上是用它来为代码段计时,只是为了方便我通过IPython 终端调用它。
I'm not sure if this is affecting your benchmarks, but be aware it could make a difference. In the question I linked to, according to time.time
two pieces of code differed by a factor of 1.7x while benchmarks using timeit
showed the pieces of code ran in essentially identical amounts of time.
我不确定这是否会影响您的基准测试,但请注意它可能会有所作为。在我链接到的问题中,根据time.time
两段代码相差 1.7 倍,而使用timeit
的基准测试显示代码段运行的时间基本相同。
回答by user2304916
I would start with inplace operations to avoid to allocate new arrays every time:
我会从就地操作开始,以避免每次都分配新数组:
partial_j1 += np.dot(delta2, a1[i,:].reshape(1,a1.shape[1]))
partial_j2 += np.dot(delta3, a2[i,:].reshape(1,a2.shape[1]))
partial_b1 += delta2
partial_b2 += delta3
You can replace this expression:
您可以替换此表达式:
a1[i,:].reshape(1,a1.shape[1])
with a simpler and faster (thanks to Bi Rico):
使用更简单更快(感谢Bi Rico):
a1[i:i+1]
Also, this line:
此外,这一行:
sum2 = sparse.beta*(-float(sparse.rho)/rhoest + float(1.0 - sparse.rho) / (1.0 - rhoest))
seems to be the same at each loop, you don't need to recompute it.
每个循环似乎都一样,你不需要重新计算它。
And, a probably minor optimization, you can replace all the occurrences of
x[i,:]
with x[i]
.
并且,可能轻微优化,可以替换所有出现
x[i,:]
用x[i]
。
Finally, if you can afford to allocate the m
times more memory, you can follow unutbusuggestion and vectorize the loop:
最后,如果你能负担得起分配m
更多内存的时间,你可以按照unutbu建议并向量化循环:
for m in range(m):
delta3 = -(x[i]-a3[i])*a3[i]* (1 - a3[i])
with:
和:
delta3 = -(x-a3)*a3*(1-a3)
And you can always use Numba and gain in speed significantly without vectorizing (and without using more memory).
而且您始终可以使用 Numba 并显着提高速度,而无需进行矢量化(并且无需使用更多内存)。
回答by Philliproso
Difference in performance between numpy and matlab have always frustrated me. They often in the end boil down to the underlying lapack libraries. As far as I know matlab uses the full atlas lapack as a default while numpy uses a lapack light. Matlab reckons people dont care about space and bulk, while numpy reckons people do. Similar questionwith a good answer.
numpy 和 matlab 之间的性能差异一直让我感到沮丧。它们通常最终归结为底层的 lapack 库。据我所知,matlab 使用完整的 atlas lapack 作为默认值,而 numpy 使用 lapack 灯。Matlab 认为人们不关心空间和体积,而 numpy 认为人们关心。类似的问题有一个很好的答案。