Python Doc2Vec 获取最相似的文档

Question

提问by Clock Slave

I am trying to build a document retrieval model that returns most documents ordered by their relevancy with respect to a query or a search string. For this I trained a doc2vec model using the Doc2Vecmodel in gensim. My dataset is in the form of a pandas dataset which has each document stored as a string on each line. This is the code I have so far

我正在尝试构建一个文档检索模型，该模型返回按查询或搜索字符串的相关性排序的大多数文档。为此，我使用 gensim 中的模型训练了 doc2vec 模型Doc2Vec。我的数据集采用 Pandas 数据集的形式，其中每个文档都存储为每一行的字符串。这是我到目前为止的代码

import gensim, re
import pandas as pd

# TOKENIZER
def tokenizer(input_string):
    return re.findall(r"[\w']+", input_string)

# IMPORT DATA
data = pd.read_csv('mp_1002_prepd.txt')
data.columns = ['merged']
data.loc[:, 'tokens'] = data.merged.apply(tokenizer)
sentences= []
for item_no, line in enumerate(data['tokens'].values.tolist()):
    sentences.append(LabeledSentence(line,[item_no]))

# MODEL PARAMETERS
dm = 1 # 1 for distributed memory(default); 0 for dbow 
cores = multiprocessing.cpu_count()
size = 300
context_window = 50
seed = 42
min_count = 1
alpha = 0.5
max_iter = 200

# BUILD MODEL
model = gensim.models.doc2vec.Doc2Vec(documents = sentences,
dm = dm,
alpha = alpha, # initial learning rate
seed = seed,
min_count = min_count, # ignore words with freq less than min_count
max_vocab_size = None, # 
window = context_window, # the number of words before and after to be used as context
size = size, # is the dimensionality of the feature vector
sample = 1e-4, # ?
negative = 5, # ?
workers = cores, # number of cores
iter = max_iter # number of iterations (epochs) over the corpus)

# QUERY BASED DOC RANKING ??

The part where I am struggling is in finding documents that are most similar/relevant to the query. I used the infer_vectorbut then realised that it considers the query as a document, updates the model and returns the results. I tried using the most_similarand most_similar_cosmulmethods but I get words along with a similarity score(I guess) in return. What I want to do is when I enter a search string(a query), I should get the documents (ids) that are most relevant along with a similarity score(cosine etc). How do I get this part done?

我挣扎的部分是查找与查询最相似/最相关的文档。我使用了infer_vector但后来意识到它将查询视为文档，更新模型并返回结果。我尝试使用most_similar和most_similar_cosmul方法，但作为回报，我得到了单词和相似度分数（我猜）。我想要做的是当我输入搜索字符串（查询）时，我应该获取最相关的文档（id）以及相似度分数（余弦等）。我如何完成这部分？

Answer 1

回答by Erock

You need to use infer_vectorto get a document vector of the new text - which does not alter the underlying model.

您需要使用infer_vector来获取新文本的文档向量 - 它不会改变底层模型。

Here is how you do it:

这是你如何做到的：

tokens = "a new sentence to match".split()

new_vector = model.infer_vector(tokens)
sims = model.docvecs.most_similar([new_vector]) #gives you top 10 document tags and their cosine similarity

Edit:

编辑：

Here is an example of how the underlying model does not change after infer_vecis called.

这infer_vec是调用后底层模型如何不改变的示例。

import numpy as np

words = "king queen man".split()

len_before =  len(model.docvecs) #number of docs

#word vectors for king, queen, man
w_vec0 = model[words[0]]
w_vec1 = model[words[1]]
w_vec2 = model[words[2]]

new_vec = model.infer_vector(words)

len_after =  len(model.docvecs)

print np.array_equal(model[words[0]], w_vec0) # True
print np.array_equal(model[words[1]], w_vec1) # True
print np.array_equal(model[words[2]], w_vec2) # True

print len_before == len_after #True

Python Doc2Vec 获取最相似的文档

提问by Clock Slave

回答by Erock

相关推荐

最近更新

标签

Python Doc2Vec 获取最相似的文档

提问by Clock Slave

回答by Erock

相关推荐

Python Pandas DataFrame 中“axis”属性的含义是什么？

Python iloc 给出“IndexError：单个位置索引器越界”

Python 如何像 MNIST 数据集一样创建图像数据集？

Python直方图轮廓

相关推荐

最近更新

标签