Python Doc2vec:如何获取文档向量

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31321209/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:47:58  来源:igfitidea点击:

Doc2vec: How to get document vectors

pythongensimword2vec

提问by bee2502

How to get document vectors of two text documents using Doc2vec? I am new to this, so it would be helpful if someone could point me in the right direction / help me with some tutorial

如何使用Doc2vec获取两个文本文档的文档向量?我是新手,所以如果有人能指出我正确的方向/帮我做一些教程会很有帮助

I am using gensim.

我正在使用 gensim。

doc1=["This is a sentence","This is another sentence"]
documents1=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4)

I get

我得到

AttributeError: 'list' object has no attribute 'words'

AttributeError: 'list' 对象没有属性 'words'

whenever I run this.

每当我运行这个。

回答by bee2502

doc=["This is a sentence","This is another sentence"]
documents=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents, size = 100, window = 300, min_count = 10, workers=4)

I got AttributeError: 'list' object has no attribute 'words' because the input documents to the Doc2vec() was not in correct LabeledSentence format. I hope this below example will help you understand the format.

我得到 AttributeError: 'list' object has no attribute 'words' 因为 Doc2vec() 的输入文档不是正确的 LabeledSentence 格式。我希望这个下面的例子能帮助你理解格式。

documents = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1']) 

More details are here : http://rare-technologies.com/doc2vec-tutorial/However, I solved the problem by taking input data from file using TaggedLineDocument().
File format: one document = one line = one TaggedDocument object. Words are expected to be already preprocessed and separated by whitespace, tags are constructed automatically from the document line number.

更多细节在这里:http: //rare-technologies.com/doc2vec-tutorial/ 但是,我通过使用 TaggedLineDocument() 从文件中获取输入数据解决了这个问题。
文件格式:一个文档 = 一行 = 一个 TaggedDocument 对象。单词应该已经过预处理并用空格分隔,标签是根据文档行号自动构造的。

sentences=doc2vec.TaggedLineDocument(file_path)
model = doc2vec.Doc2Vec(sentences,size = 100, window = 300, min_count = 10, workers=4)

To get document vector : You can use docvecs. More details here : https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument

要获取文档向量:您可以使用 docvecs。更多细节在这里:https: //radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument

docvec = model.docvecs[99] 

where 99 is the document id whose vector we want. If labels are in integer format (by default, if you load using TaggedLineDocument() ), directly use integer id like I did. If labels are in string format,use "SENT_99" .This is similar to Word2vec

其中 99 是我们想要其向量的文档 ID。如果标签是整数格式(默认情况下,如果您使用 TaggedLineDocument() 加载),请像我一样直接使用整数 id。如果标签是字符串格式,请使用 "SENT_99" 。这类似于 Word2vec

回答by l.augustyniak

Gensim was updated. The syntax of LabeledSentence does not contain labels. There are now tags- see documentation for LabeledSentence https://radimrehurek.com/gensim/models/doc2vec.html

Gensim 已更新。LabeledSentence 的语法不包含标签。现在有标签- 请参阅 LabeledSentence https://radimrehurek.com/gensim/models/doc2vec.html 的文档

However, @bee2502 was right with

然而,@bee2502 是对的

docvec = model.docvecs[99] 

It will should the 100th vector's value for trained model, it works with integers and strings.

它应该是训练模型的第 100 个向量的值,它适用于整数和字符串。

回答by Lenka Vraná

If you want to train Doc2Vec model, your data set needs to contain lists of words (similar to Word2Vec format) and tags (id of documents). It can also contain some additional info (see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynbfor more information).

如果你想训练 Doc2Vec 模型,你的数据集需要包含单词列表(类似于 Word2Vec 格式)和标签(文档的 id)。它还可以包含一些附加信息(有关更多信息,请参阅https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb)。

# Import libraries

from gensim.models import doc2vec
from collections import namedtuple

# Load data

doc1 = ["This is a sentence", "This is another sentence"]

# Transform data (you can add more data preprocessing steps) 

docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
    words = text.lower().split()
    tags = [i]
    docs.append(analyzedDocument(words, tags))

# Train model (set min_count = 1, if you want the model to work with the provided example data set)

model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)

# Get the vectors

model.docvecs[0]
model.docvecs[1]

UPDATE (how to train in epochs): This example became outdated, so I deleted it. For more information on training in epochs, see this answeror @gojomo's comment.

更新(如何在 epochs 中训练):这个例子已经过时了,所以我删除了它。有关epochs训练的更多信息,请参阅此答案或 @gojomo 的评论。

回答by MovingKyu

from gensim.models.doc2vec import Doc2Vec, TaggedDocument 
Documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(doc1)]
Model = Doc2Vec(Documents, other parameters~~)

This should work fine. You need to tag your documents for training doc2vecmodel.

这应该可以正常工作。您需要为训练doc2vec模型标记文档。