Python 得到最相似的词，给定词的向量（不是词本身）

Question

提问by amin

Using the gensim.models.Word2Veclibrary, you have the possibility to provide a model and a "word" for which you want to find the list of most similar words:

使用该gensim.models.Word2Vec库，您可以提供一个模型和一个“单词”，您想为其找到最相似的单词列表：

model = gensim.models.Word2Vec.load_word2vec_format(model_file, binary=True)
model.most_similar(positive=[WORD], topn=N)

I wonder if there is a possibility to give the system as input the model and a "vector", and ask the system to return the top similar words (which their vectors is very close to the given vector). Something similar to:

我想知道是否有可能将模型和“向量”作为系统的输入，并要求系统返回最相似的词（它们的向量非常接近给定的向量）。类似于：

model.most_similar(positive=[VECTOR], topn=N)

I need this functionality for a bilingual setting, in which I have 2 models (English and German), as well as some English words for which I need to find their most similar German candidates. What I want to do is to get the vector of each English word from the English model:

我需要在双语设置中使用此功能，其中我有 2 个模型（英语和德语），以及一些我需要为其找到最相似的德语候选词的英语单词。我想要做的是从英文模型中得到每个英文单词的向量：

model_EN = gensim.models.Word2Vec.load_word2vec_format(model_file_EN, binary=True)
vector_w_en=model_EN[WORD_EN]

and then query the German model with these vectors.

然后用这些向量查询德国模型。

model_DE = gensim.models.Word2Vec.load_word2vec_format(model_file_DE, binary=True)
model_DE.most_similar(positive=[vector_w_en], topn=N)

I have implemented this in C using the original distance function in the word2vec package. But, now I need it to be in python, in order to be able to integrate it with my other scripts.

我已经使用 word2vec 包中的原始距离函数在 C 中实现了这一点。但是，现在我需要它在 python 中，以便能够将它与我的其他脚本集成。

Do you know if there is already a method in gensim.models.Word2Veclibrary or other similar libraries which does this? Do I need to implement it by myself?

您知道gensim.models.Word2Vec库或其他类似库中是否已经有一种方法可以执行此操作吗？我需要自己实现吗？

Answer 1

回答by user48135

The method similar_by_vectorreturns the top-N most similar words by vector:

该方法similar_by_vector通过向量返回前 N 个最相似的词：

similar_by_vector(vector, topn=10, restrict_vocab=None)

Answer 2

回答by Dachrimar

I don't think what you're trying to achieve could ever give an accurate answer. Simply because the two models are trained separately. And although both the English and the German model will have similar distances between their respective word vectors. There's no guarantee that the word vector for 'House' will have the same direction as the word vector for 'Haus'.

我认为你想要达到的目标永远无法给出准确的答案。仅仅是因为两个模型是分开训练的。尽管英语和德语模型在它们各自的词向量之间会有相似的距离。无法保证“House”的词向量与“Haus”的词向量具有相同的方向。

In simple terms, if you trained both models with vector size=3. And 'House' has vector [0.5,0.2,0.9], there's no guarantee that 'Haus' will have vector [0.5,0.2,0.9] or even something close to that.

简单来说，如果你用向量大小=3 训练了两个模型。并且 'House' 具有向量 [0.5,0.2,0.9]，不能保证 'Haus' 将具有向量 [0.5,0.2,0.9] 或什至接近该向量。

In order to solve this, you could first translate the English word to German and then use the vector for that word to look for similar words in the German model.

为了解决这个问题，您可以先将英语单词翻译成德语，然后使用该单词的向量在德语模型中查找相似的单词。

TL:DR;You can't just plug in vectors from one language model into another and expect to have accurate results.

TL：博士；您不能只是将向量从一种语言模型插入另一种语言模型并期望获得准确的结果。

Python 得到最相似的词，给定词的向量（不是词本身）

提问by amin

回答by user48135

回答by Dachrimar

相关推荐

最近更新

标签

Python 得到最相似的词，给定词的向量（不是词本身）

提问by amin

回答by user48135

回答by Dachrimar

相关推荐

Python 我应该如何使用 Optional 类型提示？

Python AttributeError: 模块“tensorflow”在 Keras 中没有属性“name_scope”

使用python和opencv检测图像中的文本区域

Python pytorch 如何设置 .requires_grad False

相关推荐

最近更新

标签