Python PyTorch / Gensim - 如何加载预训练的词嵌入

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49710537/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:12:50  来源:igfitidea点击:

PyTorch / Gensim - How to load pre-trained word embeddings

pythonneural-networkpytorchgensimembedding

提问by MBT

I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.

我想将带有 gensim 的预训练 word2vec 嵌入加载到 PyTorch 嵌入层中。

So my question is, how do I get the embedding weights loaded by gensim into the PyTorch embedding layer.

所以我的问题是,如何将 gensim 加载的嵌入权重放入 PyTorch 嵌入层。

Thanks in Advance!

提前致谢!

回答by MBT

I just wanted to report my findings about loading a gensim embedding with PyTorch.

我只是想报告我关于使用 PyTorch 加载 gensim 嵌入的发现。



  • Solution for PyTorch 0.4.0and newer:

  • PyTorch0.4.0及更新版本的解决方案:

From v0.4.0there is a new function from_pretrained()which makes loading an embedding very comfortable. Here is an example from the documentation.

v0.4.0有一个新的功能from_pretrained(),这使得加载嵌入很舒服。这是文档中的一个示例。

import torch
import torch.nn as nn

# FloatTensor containing pretrained weights
weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
embedding = nn.Embedding.from_pretrained(weight)
# Get embeddings for index 1
input = torch.LongTensor([1])
embedding(input)

The weights from gensimcan easily be obtained by:

来自gensim的权重可以通过以下方式轻松获得:

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors) # formerly syn0, which is soon deprecated

As noted by @Guglie: in newer gensim versions the weights can be obtained by model.wv:

正如@Guglie 所指出的:在较新的 gensim 版本中,可以通过以下方式获得权重model.wv

weights = model.wv


  • Solution for PyTorch version 0.3.1and older:

  • PyTorch 版本0.3.1及更早版本的解决方案:

I'm using version 0.3.1and from_pretrained()isn't available in this version.

我正在使用版本0.3.1from_pretrained()但在此版本中不可用。

Therefore I created my own from_pretrainedso I can also use it with 0.3.1.

因此我创建了自己的,from_pretrained所以我也可以将它与0.3.1.

Code for from_pretrainedfor PyTorch versions 0.3.1or lower:

from_pretrainedPyTorch 版本0.3.1或更低版本的代码:

def from_pretrained(embeddings, freeze=True):
    assert embeddings.dim() == 2, \
         'Embeddings parameter is expected to be 2-dimensional'
    rows, cols = embeddings.shape
    embedding = torch.nn.Embedding(num_embeddings=rows, embedding_dim=cols)
    embedding.weight = torch.nn.Parameter(embeddings)
    embedding.weight.requires_grad = not freeze
    return embedding

The embedding can be loaded then just like this:

然后可以像这样加载嵌入:

embedding = from_pretrained(weights)

I hope this is helpful for someone.

我希望这对某人有帮助。

回答by jdhao

I think it is easy. Just copy the embedding weight from gensim to the corresponding weight in PyTorch embedding layer.

我认为这很容易。只需将 gensim 中的嵌入权重复制到 PyTorch嵌入层中的相应权重即可。

You need to make sure two things are correct: first is that the weight shape has to be correct, second is that the weight has to be converted to PyTorch FloatTensor type.

您需要确保两件事是正确的:首先是权重形状必须正确,其次是权重必须转换为 PyTorch FloatTensor 类型。

回答by caterButter

Had similar problem: "after training and saving embeddings in binaryformat using gensim, how I load them to torchtext?"

有类似的问题:“在使用 gensim以二进制格式训练和保存嵌入后,我如何将它们加载到 torchtext?”

I just saved the file to txt format and then follow the superb tutorialof loading custom word embeddings.

我只是将文件保存为 txt 格式,然后按照加载自定义词嵌入的精湛教程进行操作。

def convert_bin_emb_txt(out_path,emb_file):
    txt_name = basename(emb_file).split(".")[0] +".txt"
    emb_txt_file = os.path.join(out_path,txt_name)
    emb_model = KeyedVectors.load_word2vec_format(emb_file,binary=True)
    emb_model.save_word2vec_format(emb_txt_file,binary=False)
    return emb_txt_file

emb_txt_file = convert_bin_emb_txt(out_path,emb_bin_file)
custom_embeddings = vocab.Vectors(name=emb_txt_file,
                                  cache='custom_embeddings',
                                  unk_init=torch.Tensor.normal_)

TEXT.build_vocab(train_data,
                 max_size=MAX_VOCAB_SIZE,
                 vectors=custom_embeddings,
                 unk_init=torch.Tensor.normal_)

tested for: PyTorch: 1.2.0 and TorchText: 0.4.0.

测试:PyTorch:1.2.0 和 TorchText:0.4.0。

I added this answer because with the accepted answer I was not sure how to follow the linked tutorialand initialize all words not in the embeddings using the normal distribution and how to make the vectors and equal to zero.

我添加了这个答案,因为对于接受的答案,我不确定如何遵循链接的教程并使用正态分布初始化不在嵌入中的所有单词以及如何使向量和等于零。

回答by robodasha

I had the same question except that I use torchtextlibrary with pytorch as it helps with padding, batching, and other things. This is what I've done to load pre-trained embeddings with torchtext 0.3.0 and to pass them to pytorch 0.4.1 (the pytorch part uses the method mentioned by blue-phoenox):

我有同样的问题,只是我将torchtext库与 pytorch 一起使用,因为它有助于填充、批处理和其他事情。这就是我使用 torchtext 0.3.0 加载预训练嵌入并将它们传递给 pytorch 0.4.1(pytorch 部分使用blue-phoenox提到的方法)所做的工作

import torch
import torch.nn as nn
import torchtext.data as data
import torchtext.vocab as vocab

# use torchtext to define the dataset field containing text
text_field = data.Field(sequential=True)

# load your dataset using torchtext, e.g.
dataset = data.Dataset(examples=..., fields=[('text', text_field), ...])

# build vocabulary
text_field.build_vocab(dataset)

# I use embeddings created with
# model = gensim.models.Word2Vec(...)
# model.wv.save_word2vec_format(path_to_embeddings_file)

# load embeddings using torchtext
vectors = vocab.Vectors(path_to_embeddings_file) # file created by gensim
text_field.vocab.set_vectors(vectors.stoi, vectors.vectors, vectors.dim)

# when defining your network you can then use the method mentioned by blue-phoenox
embedding = nn.Embedding.from_pretrained(torch.FloatTensor(text_field.vocab.vectors))

# pass data to the layer
dataset_iter = data.Iterator(dataset, ...)
for batch in dataset_iter:
    ...
    embedding(batch.text)

回答by Jibin Mathew

from gensim.models import Word2Vec

model = Word2Vec(reviews,size=100, window=5, min_count=5, workers=4)
#gensim model created

import torch

weights = torch.FloatTensor(model.wv.vectors)
embedding = nn.Embedding.from_pretrained(weights)

回答by Victor Zuanazzi

I had quite some problems in understanding the documentation myself and there aren't that many good examples around. Hopefully this example helps other people. It is a simple classifier, that takes the pretrained embeddings in the matrix_embeddings. By setting requires_gradto false we make sure that we are not changing them.

我自己在理解文档方面遇到了很多问题,周围没有那么多好的例子。希望这个例子可以帮助其他人。它是一个简单的分类器,在matrix_embeddings. 通过设置requires_grad为 false 我们确保我们不会改变它们。

class InferClassifier(nn.Module):
  def __init__(self, input_dim, n_classes, matrix_embeddings):
    """initializes a 2 layer MLP for classification.
    There are no non-linearities in the original code, Katia instructed us 
    to use tanh instead"""

    super(InferClassifier, self).__init__()

    #dimensionalities
    self.input_dim = input_dim
    self.n_classes = n_classes
    self.hidden_dim = 512

    #embedding
    self.embeddings = nn.Embedding.from_pretrained(matrix_embeddings)
    self.embeddings.requires_grad = False

    #creates a MLP
    self.classifier = nn.Sequential(
            nn.Linear(self.input_dim, self.hidden_dim),
            nn.Tanh(), #not present in the original code.
            nn.Linear(self.hidden_dim, self.n_classes))

  def forward(self, sentence):
    """forward pass of the classifier
    I am not sure it is necessary to make this explicit."""

    #get the embeddings for the inputs
    u = self.embeddings(sentence)

    #forward to the classifier
    return self.classifier(x)

sentenceis a vector with the indexes of matrix_embeddingsinstead of words.

sentence是一个带有索引的向量,matrix_embeddings而不是单词。