Python 如何在 keras 中添加注意力机制?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42918446/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to add an attention mechanism in keras?
提问by Aryo Pradipta Gema
I'm currently using this code that i get from one discussion on githubHere's the code of the attention mechanism:
我目前正在使用从github上的一次讨论中获得的这段代码,这是注意力机制的代码:
_input = Input(shape=[max_length], dtype='int32')
# get the embedding layer
embedded = Embedding(
input_dim=vocab_size,
output_dim=embedding_size,
input_length=max_length,
trainable=False,
mask_zero=False
)(_input)
activations = LSTM(units, return_sequences=True)(embedded)
# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)
sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)
probabilities = Dense(3, activation='softmax')(sent_representation)
Is this the correct way to do it? i was sort of expecting the existence of time distributed layer since attention mechanism is distributed in every time step of the RNN. I need someone to confirm that this implementation(the code) is a correct implementation of attention mechanism. Thank you.
这是正确的方法吗?我有点期待时间分布层的存在,因为注意力机制分布在 RNN 的每个时间步长中。我需要有人来确认这个实现(代码)是注意力机制的正确实现。谢谢你。
采纳答案by Philippe Remy
If you want to have an attention along the time dimension, then this part of your code seems correct to me:
如果您想关注时间维度,那么这部分代码对我来说似乎是正确的:
activations = LSTM(units, return_sequences=True)(embedded)
# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)
sent_representation = merge([activations, attention], mode='mul')
You've worked out the attention vector of shape (batch_size, max_length)
:
你已经计算出 shape 的注意力向量(batch_size, max_length)
:
attention = Activation('softmax')(attention)
I've never seen this code before, so I can't say if this one is actually correct or not:
我以前从未见过这个代码,所以我不能说这个代码是否真的正确:
K.sum(xin, axis=-2)
Further reading (you might have a look):
进一步阅读(你可以看看):
回答by MJeremy
Attention mechanism pays attention to different part of the sentence:
注意力机制关注句子的不同部分:
activations = LSTM(units, return_sequences=True)(embedded)
activations = LSTM(units, return_sequences=True)(embedded)
And it determines the contribution of each hidden state of that sentence by
它通过以下方式确定该句子的每个隐藏状态的贡献
- Computing the aggregation of each hidden state
attention = Dense(1, activation='tanh')(activations)
- Assigning weights to different state
attention = Activation('softmax')(attention)
- 计算每个隐藏状态的聚合
attention = Dense(1, activation='tanh')(activations)
- 为不同的状态分配权重
attention = Activation('softmax')(attention)
And finally pay attention to different states:
最后注意不同的状态:
sent_representation = merge([activations, attention], mode='mul')
sent_representation = merge([activations, attention], mode='mul')
I don't quite understand this part: sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)
我不太明白这部分: sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)
To understand more, you can refer to thisand this, and also this onegives a good implementation, see if you can understand more on your own.
回答by Abhijay Ghildyal
Recently I was working with applying attention mechanism on a dense layer and here is one sample implementation:
最近我正在研究在密集层上应用注意力机制,这是一个示例实现:
def build_model():
input_dims = train_data_X.shape[1]
inputs = Input(shape=(input_dims,))
dense1800 = Dense(1800, activation='relu', kernel_regularizer=regularizers.l2(0.01))(inputs)
attention_probs = Dense( 1800, activation='sigmoid', name='attention_probs')(dense1800)
attention_mul = multiply([ dense1800, attention_probs], name='attention_mul')
dense7 = Dense(7, kernel_regularizer=regularizers.l2(0.01), activation='softmax')(attention_mul)
model = Model(input=[inputs], output=dense7)
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
return model
print (model.summary)
model.fit( train_data_X, train_data_Y_, epochs=20, validation_split=0.2, batch_size=600, shuffle=True, verbose=1)