Python gensim word2vec:查找词汇表中的单词数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35596031/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 16:39:03  来源:igfitidea点击:

gensim word2vec: Find number of words in vocabulary

pythonneural-networknlpgensimword2vec

提问by hlin117

After training a word2vec model using python gensim, how do you find the number of words in the model's vocabulary?

使用 python gensim训练一个 word2vec 模型后,你如何找到模型词汇表中的单词数?

采纳答案by gojomo

The vocabulary is in the vocabfield of the Word2Vec model's wvproperty, as a dictionary, with the keys being each token (word). So it's just the usual Python for getting a dictionary's length:

词汇表在vocabWord2Vec 模型的wv属性字段中,作为字典,键是每个标记(单词)。所以它只是用于获取字典长度的常用 Python:

len(w2v_model.wv.vocab)

(In older gensim versions before 0.13, vocabappeared directly on the model. So you would use w2v_model.vocabinstead of w2v_model.wv.vocab.)

(在 0.13 之前的旧 gensim 版本中,vocab直接出现在模型上。因此您将使用w2v_model.vocab代替w2v_model.wv.vocab。)

回答by kmario23

One more way to get the vocabulary size is from the embedding matrix itself as in:

获得词汇量大小的另一种方法是从嵌入矩阵本身,如下所示:

In [33]: from gensim.models import Word2Vec

# load the pretrained model
In [34]: model = Word2Vec.load(pretrained_model)

# get the shape of embedding matrix    
In [35]: model.wv.vectors.shape
Out[35]: (662109, 300)

# `vocabulary_size` is just the number of rows (i.e. axis 0)
In [36]: model.wv.vectors.shape[0]
Out[36]: 662109