Python 在 TensorFlow 中使用预训练的词嵌入(word2vec 或 Glove)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35687678/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 16:49:25  来源:igfitidea点击:

Using a pre-trained word embedding (word2vec or Glove) in TensorFlow

pythonnumpytensorflowdeep-learning

提问by user3147590

I've recently reviewed an interesting implementation for convolutional text classification. However all TensorFlow code I've reviewed uses a random (not pre-trained) embedding vectors like the following:

我最近回顾了一个有趣的卷积文本分类实现。但是,我查看过的所有 TensorFlow 代码都使用随机(未预训练)嵌入向量,如下所示:

with tf.device('/cpu:0'), tf.name_scope("embedding"):
    W = tf.Variable(
        tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
        name="W")
    self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
    self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

Does anybody know how to use the results of Word2vec or a GloVe pre-trained word embedding instead of a random one?

有人知道如何使用 Word2vec 或 GloVe 预训练词嵌入的结果而不是随机词嵌入吗?

回答by mrry

There are a few ways that you can use a pre-trained embedding in TensorFlow. Let's say that you have the embedding in a NumPy array called embedding, with vocab_sizerows and embedding_dimcolumns and you want to create a tensor Wthat can be used in a call to tf.nn.embedding_lookup().

有几种方法可以在 TensorFlow 中使用预训练的嵌入。假设你有一个与NumPy阵列称为嵌入embedding,用vocab_size行和embedding_dim列,要创建一个张量W,可以在一个呼叫中使用tf.nn.embedding_lookup()

  1. Simply create Was a tf.constant()that takes embeddingas its value:

    W = tf.constant(embedding, name="W")
    

    This is the easiest approach, but it is not memory efficient because the value of a tf.constant()is stored multiple times in memory. Since embeddingcan be very large, you should only use this approach for toy examples.

  2. Create Was a tf.Variableand initialize it from the NumPy array via a tf.placeholder():

    W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]),
                    trainable=False, name="W")
    
    embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim])
    embedding_init = W.assign(embedding_placeholder)
    
    # ...
    sess = tf.Session()
    
    sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})
    

    This avoid storing a copy of embeddingin the graph, but it does require enough memory to keep two copies of the matrix in memory at once (one for the NumPy array, and one for the tf.Variable). Note that I've assumed that you want to hold the embedding matrix constant during training, so Wis created with trainable=False.

  3. If the embedding was trained as part of another TensorFlow model, you can use a tf.train.Saverto load the value from the other model's checkpoint file. This means that the embedding matrix can bypass Python altogether. Create Was in option 2, then do the following:

    W = tf.Variable(...)
    
    embedding_saver = tf.train.Saver({"name_of_variable_in_other_model": W})
    
    # ...
    sess = tf.Session()
    embedding_saver.restore(sess, "checkpoint_filename.ckpt")
    
  1. 只需将其创建W为 a 即可作为其值:tf.constant()embedding

    W = tf.constant(embedding, name="W")
    

    这是最简单的方法,但它的内存效率不高,因为 a 的值tf.constant()在内存中存储了多次。由于embedding可能非常大,因此您应该仅将这种方法用于玩具示例。

  2. 创建W为 atf.Variable并通过 a 从 NumPy 数组初始化它tf.placeholder()

    W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]),
                    trainable=False, name="W")
    
    embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim])
    embedding_init = W.assign(embedding_placeholder)
    
    # ...
    sess = tf.Session()
    
    sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})
    

    这避免embedding在图中存储 的副本,但它确实需要足够的内存来同时在内存中保存矩阵的两个副本(一个用于 NumPy 数组,一个用于tf.Variable)。请注意,我假设您希望在训练期间保持嵌入矩阵不变,因此W使用trainable=False.

  3. 如果嵌入是作为另一个 TensorFlow 模型的一部分进行训练的,您可以使用 atf.train.Saver从另一个模型的检查点文件加载值。这意味着嵌入矩阵可以完全绕过 Python。W按照选项 2创建,然后执行以下操作:

    W = tf.Variable(...)
    
    embedding_saver = tf.train.Saver({"name_of_variable_in_other_model": W})
    
    # ...
    sess = tf.Session()
    embedding_saver.restore(sess, "checkpoint_filename.ckpt")
    

回答by LiuJia

I use this method to load and share embedding.

我使用这种方法来加载和共享嵌入。

W = tf.get_variable(name="W", shape=embedding.shape, initializer=tf.constant_initializer(embedding), trainable=False)

回答by Eugenio Martínez Cámara

The answer of @mrry is not right because it provoques the overwriting of the embeddings weights each the network is run, so if you are following a minibatch approach to train your network, you are overwriting the weights of the embeddings. So, on my point of view the right way to pre-trained embeddings is:

@mrry 的答案是不正确的,因为它在每次运行网络时都会引起对嵌入权重的覆盖,因此如果您采用小批量方法来训练您的网络,则会覆盖嵌入的权重。因此,在我看来,预训练嵌入的正确方法是:

embeddings = tf.get_variable("embeddings", shape=[dim1, dim2], initializer=tf.constant_initializer(np.array(embeddings_matrix))

回答by Tensorflow Support

2.0 Compatible Answer: There are many Pre-Trained Embeddings, which are developed by Google and which have been Open Sourced.

2.0 兼容答案:有许多预训练嵌入,由谷歌开发并已开源。

Some of them are Universal Sentence Encoder (USE), ELMO, BERT, etc.. and it is very easy to reuse them in your code.

其中一些是Universal Sentence Encoder (USE), ELMO, BERT,等等,很容易在您的代码中重用它们。

Code to reuse the Pre-Trained Embedding, Universal Sentence Encoderis shown below:

重用Pre-Trained Embedding, 的代码Universal Sentence Encoder如下所示:

  !pip install "tensorflow_hub>=0.6.0"
  !pip install "tensorflow>=2.0.0"

  import tensorflow as tf
  import tensorflow_hub as hub

  module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
  embed = hub.KerasLayer(module_url)
  embeddings = embed(["A long sentence.", "single-word",
                      "http://example.com"])
  print(embeddings.shape)  #(3,128)

For more information the Pre-Trained Embeddings developed and open-sourced by Google, refer TF Hub Link.

有关由 Google 开发和开源的 Pre-Trained Embeddings 的更多信息,请参阅TF Hub Link

回答by Fei Yan

With tensorflow version 2 its quite easy if you use the Embedding layer

如果您使用 Embedding 层,那么使用 tensorflow 版本 2 就很容易了

X=tf.keras.layers.Embedding(input_dim=vocab_size,
                            output_dim=300,
                            input_length=Length_of_input_sequences,
                            embeddings_initializer=matrix_of_pretrained_weights
                            )(ur_inp)

回答by Aaditya Ura

I was also facing embedding issue, So i wrote detailed tutorial with dataset. Here I would like to add what I tried You can also try this method,

我也面临嵌入问题,所以我用数据集写了详细的教程。这里我想补充一下我试过的你也可以试试这个方法,

import tensorflow as tf

tf.reset_default_graph()

input_x=tf.placeholder(tf.int32,shape=[None,None])

#you have to edit shape according to your embedding size


Word_embedding = tf.get_variable(name="W", shape=[400000,100], initializer=tf.constant_initializer(np.array(word_embedding)), trainable=False)
embedding_loopup= tf.nn.embedding_lookup(Word_embedding,input_x)

with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for ii in final_:
            print(sess.run(embedding_loopup,feed_dict={input_x:[ii]}))

Here is working detailed Tutorial Ipython exampleif you want to understand from scratch , take a look .

这里是详细的教程Ipython 示例,如果您想从头了解,请看一看。