Python Gensim：KeyError：“单词不在词汇表中”

Question

提问by Krishnang K Dalal

I have a trained Word2vec model using Python's Gensim Library. I have a tokenized list as below. The vocab size is 34 but I am just giving few out of 34:

我有一个使用 Python 的 Gensim 库训练的 Word2vec 模型。我有一个标记化列表，如下所示。词汇量是 34，但我只给出了 34 个中的几个：

b = ['let',
 'know',
 'buy',
 'someth',
 'featur',
 'mashabl',
 'might',
 'earn',
 'affili',
 'commiss',
 'fifti',
 'year',
 'ago',
 'graduat',
 '21yearold',
 'dustin',
 'hoffman',
 'pull',
 'asid',
 'given',
 'one',
 'piec',
 'unsolicit',
 'advic',
 'percent',
 'buy']

Model

模型

model = gensim.models.Word2Vec(b,min_count=1,size=32)
print(model) 
### prints: Word2Vec(vocab=34, size=32, alpha=0.025) ####

if I try to get the similarity score by doing model['buy']of one the words in the list, I get the

如果我尝试通过model['buy']列表中的一个词来获得相似度分数，我会得到

KeyError: "word 'buy' not in vocabulary"

KeyError：“单词‘buy’不在词汇表中”

Can you guys suggest me what I am doing wrong and what are the ways to check the model which can be further used to train PCA or t-sne in order to visualize similar words forming a topic? Thank you.

你们能告诉我我做错了什么，有什么方法可以检查模型，该模型可以进一步用于训练 PCA 或 t-sne，以便可视化形成主题的相似词？谢谢你。

Answer 1

回答by bunji

The first parameter passed to gensim.models.Word2Vecis an iterable of sentences. Sentences themselves are a list of words. From the docs:

传递给的第一个参数gensim.models.Word2Vec是一个可迭代的句子。句子本身就是一个单词列表。从文档：

Initialize the model from an iterable of sentences. Each sentence is a list of words (unicode strings) that will be used for training.

从的可迭代对象初始化模型sentences。每个句子都是一个用于训练的单词列表（unicode 字符串）。

Right now, it thinks that each word in your list bis a sentence and so it is doing Word2Vecfor each characterin each word, as opposed to each word in your b. Right now you can do:

现在，它认为列表中的每个单词b都是一个句子，因此它对每个单词中的Word2Vec每个字符都这样做，而不是b. 现在你可以这样做：

model = gensim.models.Word2Vec(b,min_count=1,size=32)

print(model['a'])
array([  7.42487283e-03,  -5.65282721e-03,   1.28707094e-02, ... ]

To get it to work for words, simply wrap bin another list so that it is interpreted correctly:

要使其适用于单词，只需将其包装b在另一个列表中即可正确解释：

model = gensim.models.Word2Vec([b],min_count=1,size=32)

print(model['buy'])
array([-0.01331611,  0.00496594, -0.00165093, -0.01444992,  0.01393849, ... ]

Answer 2

回答by Ravi G

From the docs you need to pass iterable sentences so whatever you pass to the function it treats input as a iterable so here you are passing only words so it counts word2vec vector for each in charecter in the whole corpus.

从文档中，您需要传递可迭代的句子，因此无论您传递给函数的是什么，它都会将输入视为可迭代的，因此在这里您只传递单词，因此它计算整个语料库中每个字符的 word2vec 向量。

So In order to avoid that problem, pass the list of words inside a list.

因此，为了避免该问题，请在列表中传递单词列表。

word2vec_model = gensim.models.Word2Vec([b],min_count=1,size=32)

Python Gensim：KeyError：“单词不在词汇表中”

提问by Krishnang K Dalal

回答by bunji

回答by Ravi G

相关推荐

最近更新

标签

Python Gensim：KeyError：“单词不在词汇表中”

提问by Krishnang K Dalal

回答by bunji

回答by Ravi G

相关推荐

致命错误：安装 opencv 时未找到“Python.h”文件

Python spyder 更改编辑器默认字体/比例/缩放

Python 错误：在 virtualenv 上安装某些软件包时“没有名为 _markerlib 的模块”

Python Pandas: IndexingError: Unalignable boolean Series 作为索引器提供

相关推荐

最近更新

标签