如何使用python的gensim的word2vec模型计算句子相似度
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22129943/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to calculate the sentence similarity using word2vec model of gensim with python
提问by zhfkt
According to the Gensim Word2Vec, I can use the word2vec model in gensim package to calculate the similarity between 2 words.
根据Gensim Word2Vec,我可以使用 gensim 包中的 word2vec 模型来计算 2 个词之间的相似度。
e.g.
例如
trained_model.similarity('woman', 'man')
0.73723527
However, the word2vec model fails to predict the sentence similarity. I find out the LSI model with sentence similarity in gensim, but, which doesn't seem that can be combined with word2vec model. The length of corpus of each sentence I have is not very long (shorter than 10 words). So, are there any simple ways to achieve the goal?
然而,word2vec 模型无法预测句子相似度。我在gensim中找到了具有句子相似性的LSI模型,但是,似乎不能与word2vec模型结合使用。我的每个句子语料库的长度都不是很长(少于10个单词)。那么,有没有什么简单的方法可以实现目标呢?
采纳答案by Michael Aaron Safyan
This is actually a pretty challenging problem that you are asking. Computing sentence similarity requires building a grammatical model of the sentence, understanding equivalent structures (e.g. "he walked to the store yesterday" and "yesterday, he walked to the store"), finding similarity not just in the pronouns and verbs but also in the proper nouns, finding statistical co-occurences / relationships in lots of real textual examples, etc.
这实际上是您要问的一个非常具有挑战性的问题。计算句子相似度需要建立句子的语法模型,理解等价结构(例如“他昨天走到商店”和“昨天,他走到商店”),不仅在代词和动词中寻找相似性,还需要在代词和动词中找到相似性。专有名词,在许多真实的文本示例中找到统计共现/关系等。
The simplest thing you could try -- though I don't know how well this would perform and it would certainly not give you the optimal results -- would be to first remove all "stop" words (words like "the", "an", etc. that don't add much meaning to the sentence) and then run word2vec on the words in both sentences, sum up the vectors in the one sentence, sum up the vectors in the other sentence, and then find the difference between the sums. By summing them up instead of doing a word-wise difference, you'll at least not be subject to word order. That being said, this will fail in lots of ways and isn't a good solution by any means (though good solutions to this problem almost always involve some amount of NLP, machine learning, and other cleverness).
你可以尝试的最简单的事情——虽然我不知道这会表现得如何,而且肯定不会给你最佳的结果——首先删除所有“停止”词(像“the”、“an”这样的词)。 ”等对句子没有太大意义的)然后对两个句子中的单词运行 word2vec,将一个句子中的向量相加,将另一个句子中的向量相加,然后找到它们之间的差异总和。通过总结它们而不是逐字区分,您至少不会受到词序的影响。话虽如此,这将在很多方面失败,并且无论如何都不是一个好的解决方案(尽管这个问题的好的解决方案几乎总是涉及一定数量的 NLP、机器学习和其他聪明才智)。
So, short answer is, no, there's no easy way to do this (at least not to do it well).
所以,简短的回答是,不,没有简单的方法可以做到这一点(至少不能做得很好)。
回答by Rani Nelken
Once you compute the sum of the two sets of word vectors, you should take the cosine between the vectors, not the diff. The cosine can be computed by taking the dot product of the two vectors normalized. Thus, the word count is not a factor.
一旦计算了两组词向量的总和,您应该取向量之间的余弦,而不是差异。余弦可以通过将两个向量归一化的点积来计算。因此,字数不是一个因素。
回答by lechatpito
I am using the following method and it works well. You first need to run a POSTagger and then filter your sentence to get rid of the stop words (determinants, conjunctions, ...). I recommend TextBlob APTagger. Then you build a word2vec by taking the mean of each word vector in the sentence. The n_similarity method in Gemsim word2vecdoes exactly that by allowing to pass two sets of words to compare.
我正在使用以下方法并且效果很好。您首先需要运行 POSTagger,然后过滤您的句子以去除停用词(行列式、连词等)。我推荐TextBlob APTagger。然后通过取句子中每个词向量的均值来构建 word2vec。Gemsim word2vec中的n_similarity 方法通过允许传递两组单词进行比较来做到这一点。
回答by Willie
Since you're using gensim, you should probably use it's doc2vec implementation. doc2vec is an extension of word2vec to the phrase-, sentence-, and document-level. It's a pretty simple extension, described here
由于您使用的是 gensim,您可能应该使用它的 doc2vec 实现。doc2vec 是 word2vec 到短语、句子和文档级别的扩展。这是一个非常简单的扩展,描述在这里
http://cs.stanford.edu/~quocle/paragraph_vector.pdf
http://cs.stanford.edu/~quocle/paragraph_vector.pdf
Gensim is nice because it's intuitive, fast, and flexible. What's great is that you can grab the pretrained word embeddings from the official word2vec page and the syn0 layer of gensim's Doc2Vec model is exposed so that you can seed the word embeddings with these high quality vectors!
Gensim 很好,因为它直观、快速且灵活。很棒的是,你可以从 word2vec 官方页面上获取预训练的词嵌入,并且暴露了 gensim 的 Doc2Vec 模型的 syn0 层,这样你就可以用这些高质量的向量作为词嵌入的种子!
GoogleNews-vectors-negative300.bin.gz(as linked in Google Code)
GoogleNews-vectors-negative300.bin.gz(如Google 代码中的链接)
I think gensim is definitely the easiest (and so far for me, the best) tool for embedding a sentence in a vector space.
我认为 gensim 绝对是在向量空间中嵌入句子的最简单(到目前为止对我来说也是最好的)工具。
There exist other sentence-to-vector techniques than the one proposed in Le & Mikolov's paper above. Socher and Manning from Stanford are certainly two of the most famous researchers working in this area. Their work has been based on the principle of compositionally - semantics of the sentence come from:
除了上面 Le & Mikolov 的论文中提出的技术之外,还有其他的句子到向量技术。斯坦福大学的 Socher 和 Manning 无疑是该领域最著名的两位研究人员。他们的工作基于组合原则 - 句子的语义来自:
1. semantics of the words
2. rules for how these words interact and combine into phrases
They've proposed a few such models (getting increasingly more complex) for how to use compositionality to build sentence-level representations.
他们已经提出了一些这样的模型(变得越来越复杂),用于如何使用组合性来构建句子级表示。
2011 - unfolding recursive autoencoder(very comparatively simple. start here if interested)
2011 -展开递归自动编码器(非常相对简单。如果有兴趣,请从这里开始)
2012 - matrix-vector neural network
2012 -矩阵向量神经网络
2013 - neural tensor network
2013 -神经张量网络
2015 - Tree LSTM
2015 -树 LSTM
his papers are all available at socher.org. Some of these models are available, but I'd still recommend gensim's doc2vec. For one, the 2011 URAE isn't particularly powerful. In addition, it comes pretrained with weights suited for paraphrasing news-y data. The code he provides does not allow you to retrain the network. You also can't swap in different word vectors, so you're stuck with 2011's pre-word2vec embeddings from Turian. These vectors are certainly not on the level of word2vec's or GloVe's.
他的论文都可以在 socher.org 上找到。其中一些模型可用,但我仍然推荐 gensim 的 doc2vec。一方面,2011 年的 URAE 并不是特别强大。此外,它预训练了适合解释新闻数据的权重。他提供的代码不允许您重新训练网络。你也不能交换不同的词向量,所以你被困在 2011 年来自 Turian 的 pre-word2vec 嵌入。这些向量肯定不在 word2vec 或 GloVe 的水平上。
Haven't worked with the Tree LSTM yet, but it seems very promising!
还没有使用过 Tree LSTM,但看起来很有前途!
tl;dr Yeah, use gensim's doc2vec. But other methods do exist!
tl;dr 是的,使用 gensim 的 doc2vec。但其他方法确实存在!
回答by Max
There are extensions of Word2Vec intended to solve the problem of comparing longer pieces of text like phrases or sentences. One of them is paragraph2vec or doc2vec.
Word2Vec 的扩展旨在解决比较较长文本(如短语或句子)的问题。其中之一是paragraph2vec 或doc2vec。
"Distributed Representations of Sentences and Documents" http://cs.stanford.edu/~quocle/paragraph_vector.pdf
“句子和文档的分布式表示” http://cs.stanford.edu/~quocle/paragraph_vector.pdf
回答by tbmihailov
If you are using word2vec, you need to calculate the average vector for all words in every sentence/document and use cosine similarity between vectors:
如果您使用 word2vec,则需要计算每个句子/文档中所有单词的平均向量,并使用向量之间的余弦相似度:
import numpy as np
from scipy import spatial
index2word_set = set(model.wv.index2word)
def avg_feature_vector(sentence, model, num_features, index2word_set):
words = sentence.split()
feature_vec = np.zeros((num_features, ), dtype='float32')
n_words = 0
for word in words:
if word in index2word_set:
n_words += 1
feature_vec = np.add(feature_vec, model[word])
if (n_words > 0):
feature_vec = np.divide(feature_vec, n_words)
return feature_vec
Calculate similarity:
计算相似度:
s1_afv = avg_feature_vector('this is a sentence', model=model, num_features=300, index2word_set=index2word_set)
s2_afv = avg_feature_vector('this is also sentence', model=model, num_features=300, index2word_set=index2word_set)
sim = 1 - spatial.distance.cosine(s1_afv, s2_afv)
print(sim)
> 0.915479828613
回答by Lerner Zhang
I have tried the methods provided by the previous answers. It works, but the main drawback of it is that the longer the sentences the larger similarity will be(to calculate the similarity I use the cosine score of the two mean embeddings of any two sentences) since the more the words the more positive semantic effects will be added to the sentence.
我已经尝试了以前的答案提供的方法。它有效,但它的主要缺点是句子越长,相似度越大(为了计算相似度,我使用任意两个句子的两个平均嵌入的余弦分数),因为单词越多,语义效果越积极将被添加到句子中。
I thought I should change my mind and use the sentence embedding instead as studied in this paperand this.
回答by Poorna Prudhvi
I would like to update the existing solution to help the people who are going to calculate the semantic similarity of sentences.
我想更新现有的解决方案,以帮助那些要计算句子语义相似度的人。
Step 1:
第1步:
Load the suitable model using gensim and calculate the word vectors for words in the sentence and store them as a word list
使用 gensim 加载合适的模型并计算句子中单词的词向量并将它们存储为单词列表
Step 2 : Computing the sentence vector
第 2 步:计算句子向量
The calculation of semantic similarity between sentences was difficult before but recently a paper named "A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS" was proposed which suggests a simple approach by computing the weighted average of word vectors in the sentence and then remove the projections of the average vectors on their first principal component.Here the weight of a word w is a/(a + p(w)) with a being a parameter and p(w) the (estimated) word frequency called smooth inverse frequency.this method performing significantly better.
句子之间语义相似度的计算以前很困难,但最近提出了一篇名为“ A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS”的论文,该论文提出了一种简单的方法,即计算句子中词向量的加权平均值,然后删除平均向量在它们的第一个主成分上的投影。这里的词 w 的权重是 a/(a + p(w)),a 是一个参数,p(w) 是(估计的)词频,称为平滑逆频率.这种方法的性能明显更好。
A simple code to calculate the sentence vector using SIF(smooth inverse frequency) the method proposed in the paper has been given here
使用SIF(smooth inverse frequency)计算句子向量的简单代码论文中提出的方法已经在这里给出
Step 3: using sklearn cosine_similarity load two vectors for the sentences and compute the similarity.
第 3 步:使用 sklearn cosine_similarity 为句子加载两个向量并计算相似度。
This is the most simple and efficient method to compute the sentence similarity.
这是计算句子相似度的最简单有效的方法。
回答by Ehsan
you can use Word Mover's Distance algorithm. here is an easy description about WMD.
您可以使用 Word Mover 的距离算法。这是关于 WMD的简单描述。
#load word2vec model, here GoogleNews is used
model = gensim.models.KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)
#two sample sentences
s1 = 'the first sentence'
s2 = 'the second text'
#calculate distance between two sentences using WMD algorithm
distance = model.wmdistance(s1, s2)
print ('distance = %.3f' % distance)
P.s.: if you face an error about import pyemdlibrary, you can install it using following command:
Ps:如果你遇到import pyemdlibrary的错误,你可以使用以下命令安装它:
pip install pyemd
回答by Astariul
There is a function from the documentationtaking a list of words and comparing their similarities.
文档中有一个函数可以获取单词列表并比较它们的相似性。
s1 = 'This room is dirty'
s2 = 'dirty and disgusting room' #corrected variable name
distance = model.wv.n_similarity(s1.lower().split(), s2.lower().split())

