Python 如何检查 word2vec 训练模型中是否存在键

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30301922/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:12:15  来源:igfitidea点击:

How to check if a key exists in a word2vec trained model or not

pythongensimword2vec

提问by London guy

I have trained a word2vec model using a corpus of documents with Gensim. Once the model is training, I am writing the following piece of code to get the raw feature vector of a word say "view".

我已经使用 Gensim 的文档语料库训练了 word2vec 模型。一旦模型开始训练,我将编写以下代码来获取单词“view”的原始特征向量。

myModel["view"]

However, I get a KeyError for the word which is probably because this doesn't exist as a key in the list of keys indexed by word2vec. How can I check if a key exits in the index before trying to get the raw feature vector?

但是,我得到了这个词的 KeyError,这可能是因为它不作为 word2vec 索引的键列表中的键存在。在尝试获取原始特征向量之前,如何检查索引中是否存在键?

采纳答案by rakaT

convert the model into vectors with

将模型转换为向量

word_vectors = model.wv

then we can use

然后我们可以使用

if 'word' in word_vectors.vocab

回答by London guy

Answering my own question here.

在这里回答我自己的问题。

Word2Vec provides a method named contains('view') which returns True or False based on whether the corresponding word has been indexed or not.

Word2Vec 提供了一个名为contains('view') 的方法,它根据相应的单词是否已被索引来返回 True 或 False。

回答by Matt Fortier

Word2Vec also provides a 'vocab' member, which you can access directly.

Word2Vec 还提供了一个“词汇”成员,您可以直接访问它。

Using a pythonistic approach:

使用pythonistic方法:

if word in w2v_model.vocab:
    # Do something

EDITSince gensim release 2.0, the API for Word2Vec changed. To access the vocabulary you should now use this:

编辑自 gensim 2.0 版以来,Word2Vec 的 API 发生了变化。要访问词汇表,您现在应该使用:

if word in w2v_model.wv.vocab:
    # Do something

EDIT 2The attribute 'wv' is being deprecated and will be completed removed in gensim 4.0.0. Now it's back to the original answer by OP:

编辑 2不推荐使用属性“wv”,并将在 gensim 4.0.0 中完成删除。现在它回到了 OP 的原始答案:

if word in w2v_model.vocab:
    # Do something

回答by Nomiluks

Hey i know am getting late this post, but here is a piece of code that can handle this issue well. I myself using it in my code and it works like a charm :)

嘿,我知道这篇文章迟到了,但这里有一段代码可以很好地处理这个问题。我自己在我的代码中使用它,它就像一个魅力:)

   size = 300 #word vector size
   word = 'food' #word token

   try:
        wordVector = model[word].reshape((1, size))
   except KeyError:
        print "not found! ",  word

NOTE:I am using python Gensim Library for word2vec models

注意:我正在为 word2vec 模型使用 python Gensim 库

回答by Prakhar Agarwal

I generally use a filter:

我通常使用过滤器:

for doc in labeled_corpus:
    words = filter(lambda x: x in model.vocab, doc.words)

This is one simple method for getting past the KeyError on unseen words.

这是一种简单的方法,可以解决看不见的单词的 KeyError 问题。