Python tf.nn.embedding_lookup 函数有什么作用?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34870614/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:39:51  来源:igfitidea点击:

What does tf.nn.embedding_lookup function do?

pythontensorflowdeep-learningword-embeddingnatural-language-processing

提问by Poorya Pzm

tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None)

I cannot understand the duty of this function. Is it like a lookup table? Which means to return the parameters corresponding to each id (in ids)?

我无法理解此功能的职责。它像查找表吗?即返回每个id对应的参数(以id为单位)?

For instance, in the skip-grammodel if we use tf.nn.embedding_lookup(embeddings, train_inputs), then for each train_inputit finds the correspond embedding?

例如,在skip-gram模型中,如果我们使用tf.nn.embedding_lookup(embeddings, train_inputs),那么train_input它会为每个找到对应的嵌入?

采纳答案by Rafa? Józefowicz

embedding_lookupfunction retrieves rows of the paramstensor. The behavior is similar to using indexing with arrays in numpy. E.g.

embedding_lookup函数检索params张量的行。该行为类似于在 numpy 中对数组使用索引。例如

matrix = np.random.random([1024, 64])  # 64-dimensional embeddings
ids = np.array([0, 5, 17, 33])
print matrix[ids]  # prints a matrix of shape [4, 64] 

paramsargument can be also a list of tensors in which case the idswill be distributed among the tensors. For example, given a list of 3 tensors [2, 64], the default behavior is that they will represent ids: [0, 3], [1, 4], [2, 5].

params参数也可以是张量列表,在这种情况下,ids将在张量之间分布。例如,给定 3 个张量的列表[2, 64],默认行为是它们将表示ids: [0, 3], [1, 4], [2, 5]

partition_strategycontrols the way how the idsare distributed among the list. The partitioning is useful for larger scale problems when the matrix might be too large to keep in one piece.

partition_strategy控制ids在列表中分布的方式。当矩阵可能太大而无法保留为一个时,分区对于更大规模的问题很有用。

回答by Asher Stern

Yes, this function is hard to understand, until you get the point.

是的,这个函数很难理解,直到你明白这一点。

In its simplest form, it is similar to tf.gather. It returns the elements of paramsaccording to the indexes specified by ids.

在最简单的形式中,它类似于tf.gather. 它params根据 指定的索引返回 的元素ids

For example (assuming you are inside tf.InteractiveSession())

例如(假设你在里面tf.InteractiveSession()

params = tf.constant([10,20,30,40])
ids = tf.constant([0,1,2,3])
print tf.nn.embedding_lookup(params,ids).eval()

would return [10 20 30 40], because the first element (index 0) of params is 10, the second element of params (index 1) is 20, etc.

将返回[10 20 30 40],因为 params 的第一个元素(索引 0)是10,params的第二个元素(索引 1)是20,等等。

Similarly,

相似地,

params = tf.constant([10,20,30,40])
ids = tf.constant([1,1,3])
print tf.nn.embedding_lookup(params,ids).eval()

would return [20 20 40].

会回来[20 20 40]

But embedding_lookupis more than that. The paramsargument can be a listof tensors, rather than a single tensor.

embedding_lookup不止于此。该params参数可以是列表张量的,而不是单一的张量。

params1 = tf.constant([1,2])
params2 = tf.constant([10,20])
ids = tf.constant([2,0,2,1,2,3])
result = tf.nn.embedding_lookup([params1, params2], ids)

In such a case, the indexes, specified in ids, correspond to elements of tensors according to a partition strategy, where the default partition strategy is 'mod'.

在这种情况下,在 中指定的索引ids对应于根据分区策略的张量元素,其中默认分区策略是“mod”。

In the 'mod' strategy, index 0 corresponds to the first element of the first tensor in the list. Index 1 corresponds to the firstelement of the secondtensor. Index 2 corresponds to the firstelement of the thirdtensor, and so on. Simply index icorresponds to the first element of the (i+1)th tensor , for all the indexes 0..(n-1), assuming params is a list of ntensors.

在“mod”策略中,索引 0 对应于列表中第一个张量的第一个元素。索引 1 对应于第二张量的第一个元素。索引 2 对应于第三张量的第一个元素,依此类推。简单地索引对应于第 (i+1) 个张量的第一个元素,对于所有索引,假设 params 是张量列表。i0..(n-1)n

Now, index ncannot correspond to tensor n+1, because the list paramscontains only ntensors. So index ncorresponds to the secondelement of the first tensor. Similarly, index n+1corresponds to the second element of the second tensor, etc.

现在,索引n不能对应张量 n+1,因为列表params只包含n张量。所以 indexn对应于第一个张量的第二个元素。同理,indexn+1对应第二张量的第二个元素,以此类推。

So, in the code

所以,在代码中

params1 = tf.constant([1,2])
params2 = tf.constant([10,20])
ids = tf.constant([2,0,2,1,2,3])
result = tf.nn.embedding_lookup([params1, params2], ids)

index 0 corresponds to the first element of the first tensor: 1

索引 0 对应第一个张量的第一个元素:1

index 1 corresponds to the first element of the second tensor: 10

索引 1 对应第二张量的第一个元素:10

index 2 corresponds to the second element of the first tensor: 2

索引 2 对应于第一个张量的第二个元素:2

index 3 corresponds to the second element of the second tensor: 20

索引 3 对应第二张量的第二个元素:20

Thus, the result would be:

因此,结果将是:

[ 2  1  2 10  2 20]

回答by Aerin

Adding to Asher Stern's answer, paramsis interpreted as a partitioningof a large embedding tensor. It can be a single tensor representing the complete embedding tensor, or a list of X tensors all of same shape except for the first dimension, representing sharded embedding tensors.

添加到 Asher Stern 的答案中, params被解释为大型嵌入张量的分区。它可以是表示完整嵌入张量的单个张量,也可以是除第一维外都具有相同形状的 X 个张量的列表,表示分片嵌入张量。

The function tf.nn.embedding_lookupis written considering the fact that embedding (params) will be large. Therefore we need partition_strategy.

该函数tf.nn.embedding_lookup是考虑到嵌入(参数)会很大的事实而编写的。因此我们需要partition_strategy.

回答by Shanmugam Ramasamy

Another way to look at it is , assume that you flatten out the tensors to one dimensional array, and then you are performing a lookup

另一种看待它的方法是,假设您将张量展平为一维数组,然后您正在执行查找

(eg) Tensor0=[1,2,3], Tensor1=[4,5,6], Tensor2=[7,8,9]

(例如) Tensor0=[1,2,3], Tensor1=[4,5,6], Tensor2=[7,8,9]

The flattened out tensor will be as follows [1,4,7,2,5,8,3,6,9]

展平的张量将如下 [1,4,7,2,5,8,3,6,9]

Now when you do a lookup of [0,3,4,1,7] it will yeild [1,2,5,4,6]

现在,当您查找 [0,3,4,1,7] 时,它会产生 [1,2,5,4,6]

(i,e) if lookup value is 7 for example , and we have 3 tensors (or a tensor with 3 rows) then,

(i,e) 例如,如果查找值为 7,并且我们有 3 个张量(或一个有 3 行的张量),那么,

7 / 3 : (Reminder is 1, Quotient is 2) So 2nd element of Tensor1 will be shown, which is 6

7 / 3 : (Reminder is 1, Quotient is 2) 所以将显示 Tensor1 的第二个元素,即 6

回答by Yan Zhao

When the params tensor is in high dimensions, the ids only refers to top dimension. Maybe it's obvious to most of people but I have to run the following code to understand that:

当 params 张量处于高维度时,ids 仅指最高维度。也许对大多数人来说很明显,但我必须运行以下代码才能理解:

embeddings = tf.constant([[[1,1],[2,2],[3,3],[4,4]],[[11,11],[12,12],[13,13],[14,14]],
                          [[21,21],[22,22],[23,23],[24,24]]])
ids=tf.constant([0,2,1])
embed = tf.nn.embedding_lookup(embeddings, ids, partition_strategy='div')

with tf.Session() as session:
    result = session.run(embed)
    print (result)

Just trying the 'div' strategy and for one tensor, it makes no difference.

只是尝试 'div' 策略和一个张量,它没有区别。

Here is the output:

这是输出:

[[[ 1  1]
  [ 2  2]
  [ 3  3]
  [ 4  4]]

 [[21 21]
  [22 22]
  [23 23]
  [24 24]]

 [[11 11]
  [12 12]
  [13 13]
  [14 14]]]

回答by kmario23

Yes, the purpose of tf.nn.embedding_lookup()function is to perform a lookupin the embedding matrixand return the embeddings (or in simple terms the vector representation) of words.

是的,tf.nn.embedding_lookup()函数的目的是在嵌入矩阵中执行查找并返回词的嵌入(或简单来说是向量表示)。

A simple embedding matrix (of shape: vocabulary_size x embedding_dimension) would look like below. (i.e. each wordwill be represented by a vectorof numbers; hence the name word2vec)

一个简单的嵌入矩阵(形状vocabulary_size x embedding_dimension:)如下所示。(即每个单词将由一个数字向量表示;因此名称word2vec



Embedding Matrix

嵌入矩阵

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862
like 0.36808 0.20834 -0.22319 0.046283 0.20098 0.27515 -0.77127 -0.76804
between 0.7503 0.71623 -0.27033 0.20059 -0.17008 0.68568 -0.061672 -0.054638
did 0.042523 -0.21172 0.044739 -0.19248 0.26224 0.0043991 -0.88195 0.55184
just 0.17698 0.065221 0.28548 -0.4243 0.7499 -0.14892 -0.66786 0.11788
national -1.1105 0.94945 -0.17078 0.93037 -0.2477 -0.70633 -0.8649 -0.56118
day 0.11626 0.53897 -0.39514 -0.26027 0.57706 -0.79198 -0.88374 0.30119
country -0.13531 0.15485 -0.07309 0.034013 -0.054457 -0.20541 -0.60086 -0.22407
under 0.13721 -0.295 -0.05916 -0.59235 0.02301 0.21884 -0.34254 -0.70213
such 0.61012 0.33512 -0.53499 0.36139 -0.39866 0.70627 -0.18699 -0.77246
second -0.29809 0.28069 0.087102 0.54455 0.70003 0.44778 -0.72565 0.62309 


I split the above embedding matrix and loaded only the wordsin vocabwhich will be our vocabulary and the corresponding vectors in embarray.

我分裂上述嵌入基质并装载仅vocab,这将是我们的词汇并在相应的向量emb阵列。

vocab = ['the','like','between','did','just','national','day','country','under','such','second']

emb = np.array([[0.418, 0.24968, -0.41242, 0.1217, 0.34527, -0.044457, -0.49688, -0.17862],
   [0.36808, 0.20834, -0.22319, 0.046283, 0.20098, 0.27515, -0.77127, -0.76804],
   [0.7503, 0.71623, -0.27033, 0.20059, -0.17008, 0.68568, -0.061672, -0.054638],
   [0.042523, -0.21172, 0.044739, -0.19248, 0.26224, 0.0043991, -0.88195, 0.55184],
   [0.17698, 0.065221, 0.28548, -0.4243, 0.7499, -0.14892, -0.66786, 0.11788],
   [-1.1105, 0.94945, -0.17078, 0.93037, -0.2477, -0.70633, -0.8649, -0.56118],
   [0.11626, 0.53897, -0.39514, -0.26027, 0.57706, -0.79198, -0.88374, 0.30119],
   [-0.13531, 0.15485, -0.07309, 0.034013, -0.054457, -0.20541, -0.60086, -0.22407],
   [ 0.13721, -0.295, -0.05916, -0.59235, 0.02301, 0.21884, -0.34254, -0.70213],
   [ 0.61012, 0.33512, -0.53499, 0.36139, -0.39866, 0.70627, -0.18699, -0.77246 ],
   [ -0.29809, 0.28069, 0.087102, 0.54455, 0.70003, 0.44778, -0.72565, 0.62309 ]])


emb.shape
# (11, 8)


Embedding Lookup in TensorFlow

在 TensorFlow 中嵌入查找

Now we will see how can we perform embedding lookupfor some arbitrary input sentence.

现在我们将看到如何对任意输入的句子执行嵌入查找

In [54]: from collections import OrderedDict

# embedding as TF tensor (for now constant; could be tf.Variable() during training)
In [55]: tf_embedding = tf.constant(emb, dtype=tf.float32)

# input for which we need the embedding
In [56]: input_str = "like the country"

# build index based on our `vocabulary`
In [57]: word_to_idx = OrderedDict({w:vocab.index(w) for w in input_str.split() if w in vocab})

# lookup in embedding matrix & return the vectors for the input words
In [58]: tf.nn.embedding_lookup(tf_embedding, list(word_to_idx.values())).eval()
Out[58]: 
array([[ 0.36807999,  0.20834   , -0.22318999,  0.046283  ,  0.20097999,
         0.27515   , -0.77126998, -0.76804   ],
       [ 0.41800001,  0.24968   , -0.41242   ,  0.1217    ,  0.34527001,
        -0.044457  , -0.49687999, -0.17862   ],
       [-0.13530999,  0.15485001, -0.07309   ,  0.034013  , -0.054457  ,
        -0.20541   , -0.60086   , -0.22407   ]], dtype=float32)

Observe how we got the embeddingsfrom our original embedding matrix (with words) using the indices of wordsin our vocabulary.

观察我们如何使用词汇表中的单词索引从原始嵌入矩阵(带有单词)中获得嵌入

Usually, such an embedding lookup is performed by the first layer (called Embedding layer) which then passes these embeddings to RNN/LSTM/GRU layers for further processing.

通常,这种嵌入查找由第一层(称为嵌入层)执行,然后将这些嵌入传递给 RNN/LSTM/GRU 层进行进一步处理。



Side Note: Usually the vocabulary will also have a special unktoken. So, if a token from our input sentence is not present in our vocabulary, then the index corresponding to unkwill be looked up in the embedding matrix.

旁注:通常词汇表也会有一个特殊的unk标记。因此,如果我们的词汇表中不存在来自输入句子的标记,则将unk在嵌入矩阵中查找对应的索引。



P.S.Note that embedding_dimensionis a hyperparameter that one has to tune for their application but popular models like Word2Vecand GloVeuses 300dimension vector for representing each word.

PS注意,embedding_dimension是一个超参数是一个具有调整他们的应用程序,但受欢迎的车型,如Word2Vec手套使用300维向量表示每个字。

Bonus Readingword2vec skip-gram model

奖励阅读word2vec skip-gram 模型

回答by joaoaccarvalho

Since I was also intrigued by this function, I'll give my two cents.

因为我也对这个功能很感兴趣,所以我会给我两分钱。

The way I see it in the 2D case is just as a matrix multiplication (it's easy to generalize to other dimensions).

我在 2D 情况下看到它的方式就像矩阵乘法(很容易推广到其他维度)。

Consider a vocabulary with N symbols. Then, you can represent a symbol xas a vector of dimensions Nx1, one-hot-encoded.

考虑一个有 N 个符号的词汇表。然后,您可以将符号x表示为维度为 Nx1、单热编码的向量。

But you want a representation of this symbol not as a vector of Nx1, but as one with dimensions Mx1, called y.

但是你想要这个符号的表示不是作为 Nx1 的向量,而是作为一个维度为 Mx1 的向量,称为y

So, to transform xinto y, you can use and embedding matrix E, with dimensions MxN:

因此,要将x转换为y,您可以使用和嵌入矩阵E,维度为 MxN:

y= Ex.

y=E x

This is essentially what tf.nn.embedding_lookup(params, ids, ...) is doing, with the nuance that idsare just one number that represents the position of the 1 in the one-hot-encoded vector x.

这本质上是 tf.nn.embedding_lookup(params, ids, ...) 正在做的事情,其中​​的细微差别是ids只是一个数字,代表 1 在单热编码向量x 中的位置

回答by thushv89

Here's an image depicting the process of embedding lookup.

这是描述嵌入查找过程的图像。

Image: Embedding lookup process

Concisely, it gets the corresponding rows of a embedding layer, specified by a list of IDs and provide that as a tensor. It is achieved through the following process.

图:嵌入查找过程

简而言之,它获取嵌入层的相应行,由 ID 列表指定,并将其作为张量提供。它是通过以下过程实现的。

  1. Define a placeholder lookup_ids = tf.placeholder([10])
  2. Define a embedding layer embeddings = tf.Variable([100,10],...)
  3. Define the tensorflow operation embed_lookup = tf.embedding_lookup(embeddings, lookup_ids)
  4. Get the results by running lookup = session.run(embed_lookup, feed_dict={lookup_ids:[95,4,14]})
  1. 定义占位符 lookup_ids = tf.placeholder([10])
  2. 定义嵌入层 embeddings = tf.Variable([100,10],...)
  3. 定义张量流操作 embed_lookup = tf.embedding_lookup(embeddings, lookup_ids)
  4. 运行获取结果 lookup = session.run(embed_lookup, feed_dict={lookup_ids:[95,4,14]})