Python 如何使用word2vec找到最接近向量的单词

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32759712/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:11:40  来源:igfitidea点击:

How to find the closest word to a vector using word2vec

pythontext-miningdata-analysisword2vec

提问by sel

I have just started using Word2vec and I was wondering how can we find the closest word to a vector suppose. I have this vector which is the average vector for a set of vectors:

我刚刚开始使用 Word2vec,我想知道如何找到最接近向量的单词假设。我有这个向量,它是一组向量的平均向量:

array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)

Is there a straight forward way to find the most similar word in my training data to this vector?

有没有一种直接的方法可以在我的训练数据中找到与这个向量最相似的词?

Or the only solution is to calculate the cosine similarity between this vector and the vectors of each word in my training data, then select the closest one?

或者唯一的解决方案是计算这个向量和我训练数据中每个词的向量之间的余弦相似度,然后选择最接近的一个?

Thanks.

谢谢。

采纳答案by Nicolas Ivanov

For gensimimplementation of word2vec there is most_similar()function that lets you find words semantically close to a given word:

对于word2vec 的gensim实现,有一个most_similar()函数可以让您找到语义上接近给定单词的单词:

>>> model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]

or to it's vector representation:

或者它的向量表示:

>>> your_word_vector = array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
>>> model.most_similar(positive=[your_word_vector], topn=1))

where topndefines the desired number of returned results.

其中topn定义了所需的返回结果数。

However, my gut feeling is that function does exactly the same that you proposed, i.e. calculates cosine similarity for the given vector and each other vector in the dictionary (which is quite inefficient...)

但是,我的直觉是该函数与您提出的完全相同,即计算给定向量和字典中每个其他向量的余弦相似度(效率很低......)

回答by Andrew Krizhanovsky

Don't forget to add empty array with negative words in most_similarfunction:

不要忘记在most_similar函数中添加带有否定词的空数组:

import numpy as np
model_word_vector = np.array( my_vector, dtype='f')
topn = 20;
most_similar_words = model.most_similar( [ model_word_vector ], [], topn)

回答by Moobie

Alternatively, model.wv.similar_by_vector(vector, topn=10, restrict_vocab=None)is also available in the gensimpackage.

或者,model.wv。包中也提供了similar_by_vector(vector, topn=10, restrict_vocab=None)gensim

Find the top-N most similar words by vector.

Parameters:

  • vector(numpy.array)– Vector from which similarities are to be computed.

  • topn({int, False}, optional)– Number of top-N similar words to return. If topn is False, similar_by_vector returns the vector of similarity scores.

  • restrict_vocab(int, optional)– Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you've sorted the vocabulary by descending frequency.)

Returns:Sequence of (word, similarity).

Return type:list of (str, float)

通过向量找到前 N 个最相似的词。

参数:

  • vector (numpy.array)– 要计算相似度的向量。

  • topn ({int, False}, optional)– 要返回的前 N ​​个相似词的数量。如果 topn 为 False,similar_by_vector 返回相似度得分向量。

  • 限制词汇(int,可选)– 可选整数,它限制了搜索最相似值的向量范围。例如,restrict_vocab=10000 只会检查词汇顺序中的前 10000 个词向量。(如果您已按降序对词汇表进行排序,这可能很有意义。)

返回:(词,相似性)的序列。

返回类型:(str, float) 列表