Python 如何将 Gensim doc2vec 与预训练的词向量一起使用?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27470670/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to use Gensim doc2vec with pre-trained word vectors?
提问by Stergios
I recently came across the doc2vec addition to Gensim. How can I use pre-trained word vectors (e.g. found in word2vec original website) with doc2vec?
我最近遇到了 doc2vec 添加到 Gensim。如何在 doc2vec 中使用预先训练的词向量(例如在 word2vec 原始网站中找到)?
Or is doc2vec getting the word vectors from the same sentences it uses for paragraph-vector training?
还是 doc2vec 从用于段落向量训练的相同句子中获取词向量?
Thanks.
谢谢。
回答by AaronD
Radim just posted a tutorialon the doc2vec features of gensim (yesterday, I believe - your question is timely!).
拉迪姆刚刚发布了一个教程上gensim的doc2vec功能(昨天,我相信-你的问题是及时的!)。
Gensim supports loading pre-trained vectors from the C implementation, as described in the gensim models.word2vec API documentation.
Gensim 支持从C 实现加载预训练向量,如gensim models.word2vec API 文档中所述。
回答by STEVE Guo
Well, I am recently using Doc2Vec too. And I was thinking of using LDA result as word vector and fix those word vectors to get a document vector. The result isn't very interesting though. Maybe it's just my data set isn't that good. The code is below. Doc2Vec saves word vectors and document vectors together in dictionary doc2vecmodel.syn0. You can direct change the vector values. The only problem may be that you need to find out which position in syn0 represents which word or document. The vectors are stored in random order in dictionary syn0.
嗯,我最近也在使用 Doc2Vec。我正在考虑使用 LDA 结果作为词向量并修复这些词向量以获得文档向量。然而结果并不是很有趣。也许只是我的数据集不太好。代码如下。Doc2Vec 将词向量和文档向量一起保存在字典 doc2vecmodel.syn0 中。您可以直接更改矢量值。唯一的问题可能是您需要找出 syn0 中的哪个位置代表哪个单词或文档。这些向量以随机顺序存储在字典 syn0 中。
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim import corpora, models, similarities
import gensim
from sklearn import svm, metrics
import numpy
#Read in texts into div_texts(for LDA and Doc2Vec)
div_texts = []
f = open("clean_ad_nonad.txt")
lines = f.readlines()
f.close()
for line in lines:
div_texts.append(line.strip().split(" "))
#Set up dictionary and MMcorpus
dictionary = corpora.Dictionary(div_texts)
dictionary.save("ad_nonad_lda_deeplearning.dict")
#dictionary = corpora.Dictionary.load("ad_nonad_lda_deeplearning.dict")
print dictionary.token2id["junk"]
corpus = [dictionary.doc2bow(text) for text in div_texts]
corpora.MmCorpus.serialize("ad_nonad_lda_deeplearning.mm", corpus)
#LDA training
id2token = {}
token2id = dictionary.token2id
for onemap in dictionary.token2id:
id2token[token2id[onemap]] = onemap
#ldamodel = models.LdaModel(corpus, num_topics = 100, passes = 1000, id2word = id2token)
#ldamodel.save("ldamodel1000pass.lda")
#ldamodel = models.LdaModel(corpus, num_topics = 100, id2word = id2token)
ldamodel = models.LdaModel.load("ldamodel1000pass.lda")
ldatopics = ldamodel.show_topics(num_topics = 100, num_words = len(dictionary), formatted = False)
print ldatopics[10][1]
print ldatopics[10][1][1]
ldawordindex = {}
for i in range(len(dictionary)):
ldawordindex[ldatopics[0][i][1]] = i
#Doc2Vec initialize
sentences = []
for i in range(len(div_texts)):
string = "SENT_" + str(i)
sentence = models.doc2vec.LabeledSentence(div_texts[i], labels = [string])
sentences.append(sentence)
doc2vecmodel = models.Doc2Vec(sentences, size = 100, window = 5, min_count = 0, dm = 1)
print "Initial word vector for word junk:"
print doc2vecmodel["junk"]
#Replace the word vector with word vectors from LDA
print len(doc2vecmodel.syn0)
index2wordcollection = doc2vecmodel.index2word
print index2wordcollection
for i in range(len(doc2vecmodel.syn0)):
if index2wordcollection[i].startswith("SENT_"):
continue
wordindex = ldawordindex[index2wordcollection[i]]
wordvectorfromlda = [ldatopics[j][wordindex][0] for j in range(100)]
doc2vecmodel.syn0[i] = wordvectorfromlda
#print doc2vecmodel.index2word[26841]
#doc2vecmodel.syn0[0] = [0 for i in range(100)]
print "Changed word vector for word junk:"
print doc2vecmodel["junk"]
#Train Doc2Vec
doc2vecmodel.train_words = False
print "Initial doc vector for 1st document"
print doc2vecmodel["SENT_0"]
for i in range(50):
print "Round: " + str(i)
doc2vecmodel.train(sentences)
print "Trained doc vector for 1st document"
print doc2vecmodel["SENT_0"]
#Using SVM to do classification
resultlist = []
for i in range(4143):
string = "SENT_" + str(i)
resultlist.append(doc2vecmodel[string])
svm_x_train = []
for i in range(1000):
svm_x_train.append(resultlist[i])
for i in range(2210,3210):
svm_x_train.append(resultlist[i])
print len(svm_x_train)
svm_x_test = []
for i in range(1000,2210):
svm_x_test.append(resultlist[i])
for i in range(3210,4143):
svm_x_test.append(resultlist[i])
print len(svm_x_test)
svm_y_train = numpy.array([0 for i in range(2000)])
for i in range(1000,2000):
svm_y_train[i] = 1
print svm_y_train
svm_y_test = numpy.array([0 for i in range(2143)])
for i in range(1210,2143):
svm_y_test[i] = 1
print svm_y_test
svc = svm.SVC(kernel='linear')
svc.fit(svm_x_train, svm_y_train)
expected = svm_y_test
predicted = svc.predict(svm_x_test)
print("Classification report for classifier %s:\n%s\n"
% (svc, metrics.classification_report(expected, predicted)))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))
print doc2vecmodel["junk"]
回答by gojomo
Note that the "DBOW" (dm=0
) training mode doesn't require or even create word-vectors as part of the training. It merely learns document vectors that are good at predicting each word in turn (much like the word2vec skip-gram training mode).
请注意,“DBOW” ( dm=0
) 训练模式不需要甚至创建词向量作为训练的一部分。它只是学习擅长依次预测每个单词的文档向量(很像 word2vec skip-gram 训练模式)。
(Before gensim 0.12.0, there was the parameter train_words
mentioned in another comment, which some documentation suggested will co-train words. However, I don't believe this ever actually worked. Starting in gensim 0.12.0, there is the parameter dbow_words
, which works to skip-gram train words simultaneous with DBOW doc-vectors. Note that this makes training take longer – by a factor related to window
. So if you don't need word-vectors, you may still leave this off.)
(在 gensim 0.12.0 之前,train_words
在另一个评论中提到了参数,一些文档建议将共同训练单词。但是,我不相信这真的有效。从 gensim 0.12.0 开始,有参数dbow_words
,它可以与 DBOW doc-vectors 同时跳过 gram 训练单词。请注意,这会使训练花费更长的时间 - 与 相关的因素window
。因此,如果您不需要单词向量,您仍然可以将其关闭。)
In the "DM" training method (dm=1
), word-vectors are inherently trained during the process along with doc-vectors, and are likely to also affect the quality of the doc-vectors. It's theoretically possible to pre-initialize the word-vectors from prior data. But I don't know any strong theoretical or experimental reason to be confident this would improve the doc-vectors.
在“DM”训练方法(dm=1
),字矢量被固有与DOC的载体沿着过程中的训练,并有可能也影响DOC的载体的质量。理论上可以从先前的数据中预初始化词向量。但我不知道有任何强有力的理论或实验理由来相信这会改善文档向量。
One fragmentary experiment I ran along these lines suggested the doc-vector training got off to a faster start – better predictive qualities after the first few passes – but this advantage faded with more passes. Whether you hold the word vectors constant or let them continue to adjust with the new training is also likely an important consideration... but which choice is better may depend on your goals, data set, and the quality/relevance of the pre-existing word-vectors.
我沿着这些路线进行的一个零碎实验表明 doc-vector 训练开始得更快——在前几次传球后预测质量更好——但这种优势随着传球次数的增加而逐渐消失。是保持词向量不变还是让它们随着新训练继续调整也可能是一个重要的考虑因素……但哪种选择更好可能取决于您的目标、数据集以及预先存在的质量/相关性词向量。
(You could repeat my experiment with the intersect_word2vec_format()
method available in gensim 0.12.0, and try different levels of making pre-loaded vectors resistant-to-new-training via the syn0_lockf
values. But remember this is experimental territory: the basic doc2vec results don't rely on, or even necessarily improve with, reused word vectors.)
(您可以intersect_word2vec_format()
使用 gensim 0.12.0 中可用的方法重复我的实验,并尝试通过这些syn0_lockf
值不同级别的使预加载的向量抵抗新训练。但请记住这是实验领域:基本的 doc2vec 结果不t 依赖或什至必须改进重用的词向量。)
回答by álvaro Marco
This forked version of gensimallows loading pre-trained word vectors for training doc2vec. Hereyou have an example on how to use it. The word vectors must be in the C-word2vec tool text format: one line per word vector where first comes a string representing the word and then space-separated float values, one for each dimension of the embedding.
这个 gensim 的分叉版本允许加载预训练的词向量来训练 doc2vec。这里有一个关于如何使用它的示例。词向量必须采用 C-word2vec 工具文本格式:每个词向量一行,首先是一个表示词的字符串,然后是空格分隔的浮点值,每个嵌入的维度一个。
This work belongs to a paperin which they claim that using pre-trained word embeddings actually helps building the document vectors. However I am getting almostthe same results no matter I load the pre-trained embeddings or not.
这项工作属于文件中,他们声称,使用预训练字的嵌入实际上有助于构建文档向量。但是,无论我是否加载预训练的嵌入,我都会得到几乎相同的结果。
Edit:actually there is one remarkable difference in my experiments. When I loaded the pretrained embeddings I trained doc2vec for half of the iterations to get almostthe same results (training longer than that produced worse results in my task).
编辑:实际上,我的实验有一个显着的不同。当我加载预训练嵌入时,我在一半的迭代中训练了 doc2vec 以获得几乎相同的结果(训练时间比我的任务产生的结果更差)。