Python 使用 gensim 理解 LDA 实现

Question

提问by visakh

I am trying to understand how gensim package in Python implements Latent Dirichlet Allocation. I am doing the following:

我试图了解 Python 中的 gensim 包如何实现潜在狄利克雷分配。我正在做以下事情：

Define the dataset

定义数据集

documents = ["Apple is releasing a new product", 
             "Amazon sells many things",
             "Microsoft announces Nokia acquisition"]

After removing stopwords, I create the dictionary and the corpus:

删除停用词后，我创建了字典和语料库：

texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

Then I define the LDA model.

然后我定义LDA模型。

lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, update_every=1, chunksize=10000, passes=1)

Then I print the topics:

然后我打印主题：

>>> lda.print_topics(5)
['0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product', '0.077*nokia + 0.077*announces + 0.077*acquisition + 0.077*apple + 0.077*many + 0.077*amazon + 0.077*sells + 0.077*microsoft + 0.077*things + 0.077*new', '0.181*microsoft + 0.181*announces + 0.181*acquisition + 0.181*nokia + 0.031*many + 0.031*sells + 0.031*amazon + 0.031*apple + 0.031*new + 0.031*is', '0.077*acquisition + 0.077*announces + 0.077*sells + 0.077*amazon + 0.077*many + 0.077*nokia + 0.077*microsoft + 0.077*releasing + 0.077*apple + 0.077*new', '0.158*releasing + 0.158*is + 0.158*product + 0.158*new + 0.157*apple + 0.027*sells + 0.027*nokia + 0.027*announces + 0.027*acquisition + 0.027*microsoft']
2013-12-03 13:26:21,878 : INFO : topic #0: 0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product
2013-12-03 13:26:21,880 : INFO : topic #1: 0.077*nokia + 0.077*announces + 0.077*acquisition + 0.077*apple + 0.077*many + 0.077*amazon + 0.077*sells + 0.077*microsoft + 0.077*things + 0.077*new
2013-12-03 13:26:21,880 : INFO : topic #2: 0.181*microsoft + 0.181*announces + 0.181*acquisition + 0.181*nokia + 0.031*many + 0.031*sells + 0.031*amazon + 0.031*apple + 0.031*new + 0.031*is
2013-12-03 13:26:21,881 : INFO : topic #3: 0.077*acquisition + 0.077*announces + 0.077*sells + 0.077*amazon + 0.077*many + 0.077*nokia + 0.077*microsoft + 0.077*releasing + 0.077*apple + 0.077*new
2013-12-03 13:26:21,881 : INFO : topic #4: 0.158*releasing + 0.158*is + 0.158*product + 0.158*new + 0.157*apple + 0.027*sells + 0.027*nokia + 0.027*announces + 0.027*acquisition + 0.027*microsoft
>>>

I'm not able to understand much out of this result. Is it providing with a probability of the occurrence of each word? Also, what's the meaning of topic #1, topic #2 etc? I was expecting something more or less like the most important keywords.

我无法从这个结果中理解很多。它是否提供了每个单词出现的概率？另外，主题#1、主题#2 等的含义是什么？我期待的东西或多或少像最重要的关键字。

I already checked the gensim tutorialbut it didn't really help much.

我已经检查了gensim 教程，但它并没有多大帮助。

Thanks.

谢谢。

Answer 1

采纳答案by Steve P.

The answer you're looking for is in the gensim tutorial. lda.printTopics(k)prints the most contributing words for krandomly selected topics. One can assume that this is (partially) the distribution of words over each of the given topics, meaning the probability of those words appearing in the topic to the left.

您正在寻找的答案在gensim 教程中。 lda.printTopics(k)为k随机选择的主题打印最有贡献的单词。可以假设这是（部分）单词在每个给定主题上的分布，这意味着这些单词出现在左侧主题中的概率。

Usually, one would run LDA on a large corpus. Running LDA on a ridiculously small sample won't give the best results.

通常，人们会在大型语料库上运行 LDA。在一个小得离谱的样本上运行 LDA 不会给出最好的结果。

Answer 2

回答by Utsav T

I think this tutorial will help you understand everything very clearly - https://www.youtube.com/watch?v=DDq3OVp9dNA

我认为本教程将帮助您非常清楚地了解所有内容 - https://www.youtube.com/watch?v=DDq3OVp9dNA

I too faced a lot of problems understanding it at first. I'll try to outline a few points in a nutshell.

一开始我也遇到了很多理解它的问题。我将尝试简要概述几点。

In Latent Dirichlet Allocation,

在潜在狄利克雷分配中，

The order of words is not important in a document - Bag of Words model.
A documentis a distribution over topics
Each topic, in turn, is a distribution over wordsbelonging to the vocabulary
LDA is a probabilistic generative model. It is used to infer hidden variables using a posterior distribution.

单词的顺序在文档中并不重要 - Bag of Words 模型。
一个文件是在分配主题
反过来，每个主题都是属于词汇表的单词的分布
LDA 是一种概率生成模型。它用于使用后验分布推断隐藏变量。

Imagine the process of creating a document to be something like this -

想象一下创建文档的过程是这样的 -

Choose a distribution over topics
Draw a topic - and choose word from the topic. Repeat this for each of the topics

选择主题分布
画一个主题 - 并从主题中选择单词。对每个主题重复此操作

LDA is sort of backtracking along this line -given that you have a bag of words representing a document, what could be the topics it is representing ?

LDA 有点沿着这条线回溯——假设你有一袋代表文档的词，它代表的主题可能是什么？

So, in your case, the first topic (0)

因此，就您而言，第一个主题 (0)

INFO : topic #0: 0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product

is more about things, amazonand manyas they have a higher proportion and not so much about microsoftor applewhich have a significantly lower value.

更多的是things，amazon和many他们有更高的比例，而不是这么多microsoft或apple其中有显著较低的值。

I would suggest reading this blog for a much better understanding ( Edwin Chen is a genius! ) - http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

我建议阅读此博客以获得更好的理解（Edwin Chen 是个天才！） - http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

Answer 3

回答by plfrick

Since the above answers were posted, there are now some very nice visualization tools for gaining an intuition of LDA using gensim.

由于上述答案已发布，现在有一些非常好的可视化工具可以使用gensim.

Take a look at the pyLDAvis package. Here is a great notebook overview. And here is a very helpful video descriptiongeared toward the end user (9 min tutorial).

查看 pyLDAvis 包。这是一个很棒的笔记本概述。这是面向最终用户的非常有用的视频说明（9 分钟教程）。

Hope this helps!

希望这可以帮助！

Answer 4

回答by Abhijeet Singh

For understanding the usage of gensim LDA implementation, I have recently penned blog-posts implementing topic modeling from scratch on 70,000 simple-wiki dumped articles in Python.

为了理解 gensim LDA 实现的用法，我最近在 70,000 篇简单的 wiki 转储 Python 文章中撰写了从头开始实现主题建模的博客文章。

In here, there is a detailed explanation of how gensim's LDA can be used for topic modeling. One can find the usage of

在这里，详细解释了如何使用 gensim 的 LDA 进行主题建模。一个可以找到的用法

ElementTree library for extraction of article text from XML dumped file.
Regex filters to clean the articles.
NLTK stop words removal & Lemmatization
LDA from gensim library

Hope it will help understanding the LDA implementation of gensim package.

希望它有助于理解 gensim 包的 LDA 实现。

Part 1

第1部分

Topic Modelling (Part 1): Creating Article Corpus from Simple Wikipedia dump

主题建模（第 1 部分）：从简单的维基百科转储创建文章语料库

Part 2

第2部分

Topic Modelling (Part 2): Discovering Topics from Articles with Latent Dirichlet Allocation

主题建模（第 2 部分）：从具有潜在狄利克雷分配的文章中发现主题

Word cloud (10 words) of few topics that i got as an outcome.

我作为结果得到的几个主题的词云（10 个词）。

Answer 5

回答by Sara

It is returning the percent likelihood that that word is associated with that topic. By default the LDA shows you the top ten words :)

它返回该词与该主题相关联的可能性百分比。默认情况下，LDA 会显示前十个单词 :)

Python 使用 gensim 理解 LDA 实现

提问by visakh

采纳答案by Steve P.

回答by Utsav T

回答by plfrick

回答by Abhijeet Singh

回答by Sara

相关推荐

最近更新

标签

Python 使用 gensim 理解 LDA 实现

提问by visakh

采纳答案by Steve P.

回答by Utsav T

回答by plfrick

回答by Abhijeet Singh

回答by Sara

相关推荐

Python 如何在 sklearn 中编写自定义估算器并对其使用交叉验证？

让 Python 程序等待

Python套接字连接异常

Python pandas 比较会引发 TypeError：无法将 dtyped [float64] 数组与 [bool] 类型的标量进行比较

相关推荐

最近更新

标签