Python Gensim:KeyError:“单词不在词汇表中”
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45420466/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Gensim: KeyError: "word not in vocabulary"
提问by Krishnang K Dalal
I have a trained Word2vec model using Python's Gensim Library. I have a tokenized list as below. The vocab size is 34 but I am just giving few out of 34:
我有一个使用 Python 的 Gensim 库训练的 Word2vec 模型。我有一个标记化列表,如下所示。词汇量是 34,但我只给出了 34 个中的几个:
b = ['let',
'know',
'buy',
'someth',
'featur',
'mashabl',
'might',
'earn',
'affili',
'commiss',
'fifti',
'year',
'ago',
'graduat',
'21yearold',
'dustin',
'hoffman',
'pull',
'asid',
'given',
'one',
'piec',
'unsolicit',
'advic',
'percent',
'buy']
Model
模型
model = gensim.models.Word2Vec(b,min_count=1,size=32)
print(model)
### prints: Word2Vec(vocab=34, size=32, alpha=0.025) ####
if I try to get the similarity score by doing model['buy']
of one the words in the list, I get the
如果我尝试通过model['buy']
列表中的一个词来获得相似度分数,我会得到
KeyError: "word 'buy' not in vocabulary"
KeyError:“单词‘buy’不在词汇表中”
Can you guys suggest me what I am doing wrong and what are the ways to check the model which can be further used to train PCA or t-sne in order to visualize similar words forming a topic? Thank you.
你们能告诉我我做错了什么,有什么方法可以检查模型,该模型可以进一步用于训练 PCA 或 t-sne,以便可视化形成主题的相似词?谢谢你。
回答by bunji
The first parameter passed to gensim.models.Word2Vec
is an iterable of sentences. Sentences themselves are a list of words. From the docs:
传递给的第一个参数gensim.models.Word2Vec
是一个可迭代的句子。句子本身就是一个单词列表。从文档:
Initialize the model from an iterable of
sentences
. Each sentence is a list of words (unicode strings) that will be used for training.
从 的可迭代对象初始化模型
sentences
。每个句子都是一个用于训练的单词列表(unicode 字符串)。
Right now, it thinks that each word in your list b
is a sentence and so it is doing Word2Vec
for each characterin each word, as opposed to each word in your b
. Right now you can do:
现在,它认为列表中的每个单词b
都是一个句子,因此它对每个单词中的Word2Vec
每个字符都这样做,而不是b
. 现在你可以这样做:
model = gensim.models.Word2Vec(b,min_count=1,size=32)
print(model['a'])
array([ 7.42487283e-03, -5.65282721e-03, 1.28707094e-02, ... ]
To get it to work for words, simply wrap b
in another list so that it is interpreted correctly:
要使其适用于单词,只需将其包装b
在另一个列表中即可正确解释:
model = gensim.models.Word2Vec([b],min_count=1,size=32)
print(model['buy'])
array([-0.01331611, 0.00496594, -0.00165093, -0.01444992, 0.01393849, ... ]
回答by Ravi G
From the docs you need to pass iterable sentences so whatever you pass to the function it treats input as a iterable so here you are passing only words so it counts word2vec vector for each in charecter in the whole corpus.
从文档中,您需要传递可迭代的句子,因此无论您传递给函数的是什么,它都会将输入视为可迭代的,因此在这里您只传递单词,因此它计算整个语料库中每个字符的 word2vec 向量。
So In order to avoid that problem, pass the list of words inside a list.
因此,为了避免该问题,请在列表中传递单词列表。
word2vec_model = gensim.models.Word2Vec([b],min_count=1,size=32)