Python Gensim:如何使用 LDA 模型计算文档相似度?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22433884/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Gensim: how to calculate document similarity using the LDA model?
提问by still_st
I've got a trained LDA model and I want to calculate the similarity score between two documents from the corpus I trained my model on. After studying all the Gensim tutorials and functions, I still can't get my head around it. Can somebody give me a hint? Thanks!
我有一个经过训练的 LDA 模型,我想从我训练模型的语料库中计算两个文档之间的相似性分数。在学习了所有 Gensim 教程和功能后,我仍然无法理解它。有人可以给我一个提示吗?谢谢!
采纳答案by Palisand
Don't know if this'll help but, I managed to attain successful results on document matching and similarities when using the actual document as a query.
不知道这是否会有所帮助,但是,当使用实际文档作为查询时,我设法在文档匹配和相似性方面取得了成功的结果。
dictionary = corpora.Dictionary.load('dictionary.dict')
corpus = corpora.MmCorpus("corpus.mm")
lda = models.LdaModel.load("model.lda") #result from running online lda (training)
index = similarities.MatrixSimilarity(lda[corpus])
index.save("simIndex.index")
docname = "docs/the_doc.txt"
doc = open(docname, 'r').read()
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lda = lda[vec_bow]
sims = index[vec_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims
Your similarity score between all documents residing in the corpus and the document that was used as a query will be the second index of every sim for sims.
位于语料库中的所有文档与用作查询的文档之间的相似度分数将是每个 sim 的第二个索引。
回答by Radim
Depends what similarity metric you want to use.
取决于您要使用的相似度度量。
Cosine similarityis universally useful & built-in:
sim = gensim.matutils.cossim(vec_lda1, vec_lda2)
Hellinger distanceis useful for similarity between probability distributions (such as LDA topics):
Hellinger 距离对于概率分布(例如 LDA 主题)之间的相似性很有用:
import numpy as np
dense1 = gensim.matutils.sparse2full(lda_vec1, lda.num_topics)
dense2 = gensim.matutils.sparse2full(lda_vec2, lda.num_topics)
sim = np.sqrt(0.5 * ((np.sqrt(dense1) - np.sqrt(dense2))**2).sum())
回答by eng.mrgh
Provided answers are good, but they aren't very beginner-friendly. I want to start from training the LDA model and calculate cosine similarity.
提供的答案很好,但它们对初学者不太友好。我想从训练LDA模型开始,计算余弦相似度。
Training model part:
训练模型部分:
docs = ["latent Dirichlet allocation (LDA) is a generative statistical model",
"each document is a mixture of a small number of topics",
"each document may be viewed as a mixture of various topics"]
# Convert document to tokens
docs = [doc.split() for doc in docs]
# A mapping from token to id in each document
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)
# Representing the corpus as a bag of words
corpus = [dictionary.doc2bow(doc) for doc in docs]
# Training the model
model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10)
For extracting the probability assigned to each topic for a document, there are generally two ways. I provide here the both:
为了提取分配给文档的每个主题的概率,通常有两种方法。我在这里提供两者:
# Some preprocessing for documents like the training the model
test_doc = ["LDA is an example of a topic model",
"topic modelling refers to the task of identifying topics"]
test_doc = [doc.split() for doc in test_doc]
test_corpus = [dictionary.doc2bow(doc) for doc in test_doc]
# Method 1
from gensim.matutils import cossim
doc1 = model.get_document_topics(test_corpus[0], minimum_probability=0)
doc2 = model.get_document_topics(test_corpus[1], minimum_probability=0)
print(cossim(doc1, doc2))
# Method 2
doc1 = model[test_corpus[0]]
doc2 = model[test_corpus[1]]
print(cossim(doc1, doc2))
output:
输出:
#Method 1
0.8279631530869963
#Method 2
0.828066885140262
As you can see both of the methods are generally the same, the difference is in the probabilities returned in the 2nd method sometimes doesn't add up to one as discussed here. For large corpus, the possibility vector could be given by passing the whole corpus:
正如您所看到的,这两种方法通常是相同的,不同之处在于第二种方法中返回的概率有时不会像此处讨论的那样加起来为 1 。对于大型语料库,可以通过传递整个语料库来给出可能性向量:
#Method 1
possibility_vector = model.get_document_topics(test_corpus, minimum_probability=0)
#Method 2
possiblity_vector = model[test_corpus]
NOTE:The sum of probability assigned to each topic in a document may become a bit higher than 1 or in some cases a bit less than 1. That is because of the floating-point arithmetic rounding errors.
注意:分配给文档中每个主题的概率总和可能会略高于 1,或者在某些情况下略低于 1。这是因为浮点算术舍入错误。

