Python numpy : 计算 softmax 函数的导数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40575841/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
numpy : calculate the derivative of the softmax function
提问by Sam Hammamy
I am trying to understand backpropagation
in a simple 3 layered neural network with MNIST
.
我试图backpropagation
在一个简单的 3 层神经网络中理解MNIST
.
There is the input layer with weights
and a bias
. The labels are MNIST
so it's a 10
class vector.
输入层带有weights
和 a bias
。标签是MNIST
一个10
类向量。
The second layer is a linear tranform
. The third layer is the softmax activation
to get the output as probabilities.
第二层是一个linear tranform
。第三层是softmax activation
将输出作为概率。
Backpropagation
calculates the derivative at each step and call this the gradient.
Backpropagation
计算每一步的导数并将其称为梯度。
Previous layers appends the global
or previous
gradient to the local gradient
. I am having trouble calculating the local gradient
of the softmax
前一层将global
或previous
渐变附加到local gradient
. 我无法计算local gradient
的softmax
Several resources online go through the explanation of the softmax and its derivatives and even give code samples of the softmax itself
网上的一些资源对 softmax 及其衍生物进行了解释,甚至给出了 softmax 本身的代码示例
def softmax(x):
"""Compute the softmax of vector x."""
exps = np.exp(x)
return exps / np.sum(exps)
The derivative is explained with respect to when i = j
and when i != j
. This is a simple code snippet I've come up with and was hoping to verify my understanding:
关于何时i = j
和何时解释导数i != j
。这是我想出的一个简单的代码片段,并希望验证我的理解:
def softmax(self, x):
"""Compute the softmax of vector x."""
exps = np.exp(x)
return exps / np.sum(exps)
def forward(self):
# self.input is a vector of length 10
# and is the output of
# (w * x) + b
self.value = self.softmax(self.input)
def backward(self):
for i in range(len(self.value)):
for j in range(len(self.input)):
if i == j:
self.gradient[i] = self.value[i] * (1-self.input[i))
else:
self.gradient[i] = -self.value[i]*self.input[j]
Then self.gradient
is the local gradient
which is a vector. Is this correct? Is there a better way to write this?
然后self.gradient
是local gradient
一个向量。这样对吗?有没有更好的方法来写这个?
回答by Wasi Ahmad
I am assuming you have a 3-layer NN with W1
, b1
for is associated with the linear transformation from input layer to hidden layer and W2
, b2
is associated with linear transformation from hidden layer to output layer. Z1
and Z2
are the input vector to the hidden layer and output layer. a1
and a2
represents the output of the hidden layer and output layer. a2
is your predicted output. delta3
and delta2
are the errors (backpropagated) and you can see the gradients of the loss function with respect to model parameters.
我假设你有一个3层NN与W1
,b1
用于与从输入层到隐藏层中的线性变换相关联,并且W2
,b2
与从隐藏层到输出层的线性变换相关联。Z1
并且Z2
是输入矢量到隐含层和输出层。a1
并a2
表示隐藏层和输出层的输出。a2
是您的预测输出。delta3
和delta2
是误差(反向传播),您可以看到损失函数相对于模型参数的梯度。
This is a general scenario for a 3-layer NN (input layer, only one hidden layer and one output layer). You can follow the procedure described above to compute gradients which should be easy to compute! Since another answer to this post already pointed to the problem in your code, i am not repeating the same.
这是 3 层 NN(输入层,只有一个隐藏层和一个输出层)的一般场景。您可以按照上述过程计算梯度,这应该很容易计算!由于这篇文章的另一个答案已经指出了您代码中的问题,因此我不再重复相同的内容。
回答by Julien
As I said, you have n^2
partial derivatives.
正如我所说,你有n^2
偏导数。
If you do the math, you find that dSM[i]/dx[k]
is SM[i] * (dx[i]/dx[k] - SM[i])
so you should have:
如果你做数学题,你会发现dSM[i]/dx[k]
是SM[i] * (dx[i]/dx[k] - SM[i])
这样,你应该有:
if i == j:
self.gradient[i,j] = self.value[i] * (1-self.value[i])
else:
self.gradient[i,j] = -self.value[i] * self.value[j]
instead of
代替
if i == j:
self.gradient[i] = self.value[i] * (1-self.input[i])
else:
self.gradient[i] = -self.value[i]*self.input[j]
By the way, this may be computed more concisely like so (vectorized):
顺便说一句,这可以像这样更简洁地计算(矢量化):
SM = self.value.reshape((-1,1))
jac = np.diagflat(self.value) - np.dot(SM, SM.T)
回答by Haesun Park
np.exp
is not stable because it has Inf.
So you should subtract maximum in x
.
np.exp
不稳定,因为它有 Inf。所以你应该减去最大值x
。
def softmax(x):
"""Compute the softmax of vector x."""
exps = np.exp(x - x.max())
return exps / np.sum(exps)
If x
is matrix, please check the softmax function in this notebook.
如果x
是矩阵,请检查本笔记本中的 softmax 函数。