如何从gensim打印LDA主题模型?Python

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15016025/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 13:06:52  来源:igfitidea点击:

How to print the LDA topics models from gensim? Python

pythonnlpldatopic-modelinggensim

提问by alvas

Using gensimI was able to extract topics from a set of documents in LSA but how do I access the topics generated from the LDA models?

使用gensim我能够从 LSA 中的一组文档中提取主题,但是如何访问从 LDA 模型生成的主题?

When printing the lda.print_topics(10)the code gave the following error because print_topics()return a NoneType:

打印lda.print_topics(10)代码时出现以下错误,因为print_topics()返回 a NoneType

Traceback (most recent call last):
  File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module>
    for top in lda.print_topics(2):
TypeError: 'NoneType' object is not iterable

The code:

编码:

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip

documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
         for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# I can print out the topics for LSA
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus]

for l,t in izip(corpus_lsi,corpus):
  print l,"#",t
print
for top in lsi.print_topics(2):
  print top

# I can print out the documents and which is the most probable topics for each doc.
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
corpus_lda = lda[corpus]

for l,t in izip(corpus_lda,corpus):
  print l,"#",t
print

# But I am unable to print out the topics, how should i do it?
for top in lda.print_topics(10):
  print top

回答by alvas

After some messing around, it seems like print_topics(numoftopics)for the ldamodelhas some bug. So my workaround is to use print_topic(topicid):

一些插科打诨后,好像print_topics(numoftopics)ldamodel有一些bug。所以我的解决方法是使用print_topic(topicid)

>>> print lda.print_topics()
None
>>> for i in range(0, lda.num_topics-1):
>>>  print lda.print_topic(i)
0.083*response + 0.083*interface + 0.083*time + 0.083*human + 0.083*user + 0.083*survey + 0.083*computer + 0.083*eps + 0.083*trees + 0.083*system
...

回答by zanbri

Are you using any logging? print_topicsprints to the logfile as stated in the docs.

您是否使用任何日志记录?print_topics打印到文档中所述的日志文件

As @mac389 says, lda.show_topics()is the way to go to print to screen.

正如@mac389 所说,lda.show_topics()是打印到屏幕的方法。

回答by Shirish Kumar

Here is sample code to print topics:

以下是打印主题的示例代码:

def ExtractTopics(filename, numTopics=5):
    # filename is a pickle file where I have lists of lists containing bag of words
    texts = pickle.load(open(filename, "rb"))

    # generate dictionary
    dict = corpora.Dictionary(texts)

    # remove words with low freq.  3 is an arbitrary number I have picked here
    low_occerance_ids = [tokenid for tokenid, docfreq in dict.dfs.iteritems() if docfreq == 3]
    dict.filter_tokens(low_occerance_ids)
    dict.compactify()
    corpus = [dict.doc2bow(t) for t in texts]
    # Generate LDA Model
    lda = models.ldamodel.LdaModel(corpus, num_topics=numTopics)
    i = 0
    # We print the topics
    for topic in lda.show_topics(num_topics=numTopics, formatted=False, topn=20):
        i = i + 1
        print "Topic #" + str(i) + ":",
        for p, id in topic:
            print dict[int(id)],

        print ""

回答by user2597000

I think syntax of show_topics has changed over time:

我认为 show_topics 的语法随着时间的推移发生了变化:

show_topics(num_topics=10, num_words=10, log=False, formatted=True)

For num_topics number of topics, return num_words most significant words (10 words per topic, by default).

对于 num_topics 个主题,返回 num_words 个最重要的词(每个主题 10 个词,默认情况下)。

The topics are returned as a list – a list of strings if formatted is True, or a list of (probability, word) 2-tuples if False.

主题以列表形式返回——如果格式为 True,则为字符串列表,如果为 False,则为(概率,单词)二元组列表。

If log is True, also output this result to log.

如果 log 为 True,也将此结果输出到 log。

Unlike LSA, there is no natural ordering between the topics in LDA. The returned num_topics <= self.num_topics subset of all topics is therefore arbitrary and may change between two LDA training runs.

与 LSA 不同,LDA 中的主题之间没有自然排序。因此,返回的所有主题的 num_topics <= self.num_topics 子集是任意的,并且可能会在两次 LDA 训练运行之间发生变化。

回答by xu2mao

you can use:

您可以使用:

for i in  lda_model.show_topics():
    print i[0], i[1]

回答by Maneet

Recently, came across a similar issue while working with Python 3 and Gensim 2.3.0. print_topics()and show_topics()weren't giving any error but also not printing anything. Turns out that show_topics()returns a list. So one can simply do:

最近,在使用 Python 3 和 Gensim 2.3.0 时遇到了类似的问题。print_topics()并且show_topics()没有给出任何错误,但也没有打印任何内容。结果是show_topics()返回一个列表。所以一个人可以简单地做:

topic_list = show_topics()
print(topic_list)

回答by Feng Mai

You can also export the top words from each topic to a csv file. topncontrols how many words under each topic to export.

您还可以将每个主题的热门词导出到 csv 文件。topn控制要导出的每个主题下的字数。

import pandas as pd

top_words_per_topic = []
for t in range(lda_model.num_topics):
    top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 5)])

pd.DataFrame(top_words_per_topic, columns=['Topic', 'Word', 'P']).to_csv("top_words.csv")

The CSV file has the following format

CSV 文件具有以下格式

Topic Word  P  
0     w1    0.004437  
0     w2    0.003553  
0     w3    0.002953  
0     w4    0.002866  
0     w5    0.008813  
1     w6    0.003393  
1     w7    0.003289  
1     w8    0.003197 
... 

回答by Samuel Nde

I think it is alway more helpful to see the topics as a list of words. The following code snippet helps acchieve that goal. I assume you already have an lda model called lda_model.

我认为将主题视为单词列表总是更有帮助。以下代码片段有助于实现该目标。我假设您已经有一个名为lda_model.

for index, topic in lda_model.show_topics(formatted=False, num_words= 30):
    print('Topic: {} \nWords: {}'.format(idx, [w[0] for w in topic]))

In the above code, I have decided to show the first 30 words belonging to each topic. For simplicity, I have shown the first topic I get.

在上面的代码中,我决定显示属于每个主题的前 30 个单词。为简单起见,我展示了我得到的第一个主题。

Topic: 0 
Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental']
Topic: 1 
Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']

I don't really like the way the above topics look so I usually modify my code to as shown:

我不太喜欢上述主题的外观,因此我通常将代码修改为如下所示:

for idx, topic in lda_model.show_topics(formatted=False, num_words= 30):
    print('Topic: {} \nWords: {}'.format(idx, '|'.join([w[0] for w in topic])))

... and the output (first 2 topics shown) will look like.

...输出(显示的前 2 个主题)将如下所示。

Topic: 0 
Words: associate|incident|time|task|pain|amcare|work|ppe|train|proper|report|standard|pmv|level|perform|wear|date|factor|overtime|location|area|yes|new|treatment|start|stretch|assign|condition|participate|environmental
Topic: 1 
Words: work|associate|cage|aid|shift|leave|area|eye|incident|aider|hit|pit|manager|return|start|continue|pick|call|come|right|take|report|lead|break|paramedic|receive|get|inform|room|head

回答by Shivom Sharma

****This code works fine but I want to know the topic name instead of Topic: 0 and Topic:1, How do i know which topic this word comes in**?** 



for index, topic in lda_model.show_topics(formatted=False, num_words= 30):
        print('Topic: {} \nWords: {}'.format(idx, [w[0] for w in topic]))

Topic: 0 
Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental']
Topic: 1 
Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']

回答by Nikita sharma

Using Gensim for cleaning it's own topic format.

使用 Gensim 清理它自己的主题格式。

from gensim.parsing.preprocessing import preprocess_string, strip_punctuation,
strip_numeric

lda_topics = lda.show_topics(num_words=5)

topics = []
filters = [lambda x: x.lower(), strip_punctuation, strip_numeric]

for topic in lda_topics:
    print(topic)
    topics.append(preprocess_string(topic[1], filters))

print(topics)

Output :

输出 :

(0, '0.020*"business" + 0.018*"data" + 0.012*"experience" + 0.010*"learning" + 0.008*"analytics"')
(1, '0.027*"data" + 0.020*"experience" + 0.013*"business" + 0.010*"role" + 0.009*"science"')
(2, '0.026*"data" + 0.016*"experience" + 0.012*"learning" + 0.011*"machine" + 0.009*"business"')
(3, '0.028*"data" + 0.015*"analytics" + 0.015*"experience" + 0.008*"business" + 0.008*"skills"')
(4, '0.014*"data" + 0.009*"learning" + 0.009*"machine" + 0.009*"business" + 0.008*"experience"')


[
  ['business', 'data', 'experience', 'learning', 'analytics'], 
  ['data', 'experience', 'business', 'role', 'science'], 
  ['data', 'experience', 'learning', 'machine', 'business'], 
  ['data', 'analytics', 'experience', 'business', 'skills'], 
  ['data', 'learning', 'machine', 'business', 'experience']
]