Python gensim word2vec:查找词汇表中的单词数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35596031/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
gensim word2vec: Find number of words in vocabulary
提问by hlin117
采纳答案by gojomo
The vocabulary is in the vocab
field of the Word2Vec model's wv
property, as a dictionary, with the keys being each token (word). So it's just the usual Python for getting a dictionary's length:
词汇表在vocab
Word2Vec 模型的wv
属性字段中,作为字典,键是每个标记(单词)。所以它只是用于获取字典长度的常用 Python:
len(w2v_model.wv.vocab)
(In older gensim versions before 0.13, vocab
appeared directly on the model. So you would use w2v_model.vocab
instead of w2v_model.wv.vocab
.)
(在 0.13 之前的旧 gensim 版本中,vocab
直接出现在模型上。因此您将使用w2v_model.vocab
代替w2v_model.wv.vocab
。)
回答by kmario23
One more way to get the vocabulary size is from the embedding matrix itself as in:
获得词汇量大小的另一种方法是从嵌入矩阵本身,如下所示:
In [33]: from gensim.models import Word2Vec
# load the pretrained model
In [34]: model = Word2Vec.load(pretrained_model)
# get the shape of embedding matrix
In [35]: model.wv.vectors.shape
Out[35]: (662109, 300)
# `vocabulary_size` is just the number of rows (i.e. axis 0)
In [36]: model.wv.vectors.shape[0]
Out[36]: 662109