Python Gensim:KeyError:“单词不在词汇表中”

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45420466/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:00:41  来源:igfitidea点击:

Gensim: KeyError: "word not in vocabulary"

pythonnlpgensimword2vectopic-modeling

提问by Krishnang K Dalal

I have a trained Word2vec model using Python's Gensim Library. I have a tokenized list as below. The vocab size is 34 but I am just giving few out of 34:

我有一个使用 Python 的 Gensim 库训练的 Word2vec 模型。我有一个标记化列表,如下所示。词汇量是 34,但我只给出了 34 个中的几个:

b = ['let',
 'know',
 'buy',
 'someth',
 'featur',
 'mashabl',
 'might',
 'earn',
 'affili',
 'commiss',
 'fifti',
 'year',
 'ago',
 'graduat',
 '21yearold',
 'dustin',
 'hoffman',
 'pull',
 'asid',
 'given',
 'one',
 'piec',
 'unsolicit',
 'advic',
 'percent',
 'buy']

Model

模型

model = gensim.models.Word2Vec(b,min_count=1,size=32)
print(model) 
### prints: Word2Vec(vocab=34, size=32, alpha=0.025) ####

if I try to get the similarity score by doing model['buy']of one the words in the list, I get the

如果我尝试通过model['buy']列表中的一个词来获得相似度分数,我会得到

KeyError: "word 'buy' not in vocabulary"

KeyError:“单词‘buy’不在词汇表中”

Can you guys suggest me what I am doing wrong and what are the ways to check the model which can be further used to train PCA or t-sne in order to visualize similar words forming a topic? Thank you.

你们能告诉我我做错了什么,有什么方法可以检查模型,该模型可以进一步用于训练 PCA 或 t-sne,以便可视化形成主题的相似词?谢谢你。

回答by bunji

The first parameter passed to gensim.models.Word2Vecis an iterable of sentences. Sentences themselves are a list of words. From the docs:

传递给的第一个参数gensim.models.Word2Vec是一个可迭代的句子。句子本身就是一个单词列表。从文档:

Initialize the model from an iterable of sentences. Each sentence is a list of words (unicode strings) that will be used for training.

从 的可迭代对象初始化模型sentences。每个句子都是一个用于训练的单词列表(unicode 字符串)。

Right now, it thinks that each word in your list bis a sentence and so it is doing Word2Vecfor each characterin each word, as opposed to each word in your b. Right now you can do:

现在,它认为列表中的每个单词b都是一个句子,因此它对每个单词中的Word2Vec每个字符都这样做,而不是b. 现在你可以这样做:

model = gensim.models.Word2Vec(b,min_count=1,size=32)

print(model['a'])
array([  7.42487283e-03,  -5.65282721e-03,   1.28707094e-02, ... ]

To get it to work for words, simply wrap bin another list so that it is interpreted correctly:

要使其适用于单词,只需将其包装b在另一个列表中即可正确解释:

model = gensim.models.Word2Vec([b],min_count=1,size=32)

print(model['buy'])
array([-0.01331611,  0.00496594, -0.00165093, -0.01444992,  0.01393849, ... ]

回答by Ravi G

From the docs you need to pass iterable sentences so whatever you pass to the function it treats input as a iterable so here you are passing only words so it counts word2vec vector for each in charecter in the whole corpus.

从文档中,您需要传递可迭代的句子,因此无论您传递给函数的是什么,它都会将输入视为可迭代的,因此在这里您只传递单词,因此它计算整个语料库中每个字符的 word2vec 向量。

So In order to avoid that problem, pass the list of words inside a list.

因此,为了避免该问题,请在列表中传递单词列表。

word2vec_model = gensim.models.Word2Vec([b],min_count=1,size=32)