Python 打印包含和排除停用词的文本中出现频率最高的 10 个词

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28392860/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:12:42  来源:igfitidea点击:

Print 10 most frequently occurring words of a text that including and excluding stopwords

pythonnltkword-frequencyfind-occurrences

提问by user2064809

I got the question from herewith my changes. I have following code:

我从这里得到了我的变化的问题。我有以下代码:

from nltk.corpus import stopwords
>>> def content_text(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() in stopwords]
    return content

How can I printthe 10 most frequently occurring words of a text that 1)includingand 2)excludingstopwords?

如何打印文本中最常出现的 10 个单词,其中 1)包括和 2)排除停用词?

采纳答案by Padraic Cunningham

Not sure on the is stopwordsin the function, I imagine it needs to be inbut you can use a Counterdict with most_common(10)to get the 10 most frequent:

不确定在is stopwords函数中,我想它需要是in但你可以使用 Counterdict withmost_common(10)来获得 10 个最常见的:

from collections import Counter
from string import punctuation


def content_text(text):
    stopwords = set(nltk.corpus.stopwords.words('english')) # 0(1) lookups
    with_stp = Counter()
    without_stp  = Counter()
    with open(text) as f:
        for line in f:
            spl = line.split()
            # update count off all words in the line that are in stopwrods
            with_stp.update(w.lower().rstrip(punctuation) for w in spl if w.lower() in stopwords)
               # update count off all words in the line that are not in stopwords
            without_stp.update(w.lower().rstrip(punctuation)  for w in spl if w  not in stopwords)
    # return a list with top ten most common words from each 
    return [x for x in with_stp.most_common(10)],[y for y in without_stp.most_common(10)]
wth_stop, wthout_stop = content_text(...)

If you are passing in an nltk file object just iterate over it:

如果您传入一个 nltk 文件对象,只需对其进行迭代:

def content_text(text):
    stopwords = set(nltk.corpus.stopwords.words('english'))
    with_stp = Counter()
    without_stp  = Counter()
    for word in text:
        # update count off all words in the line that are in stopwords
        word = word.lower()
        if word in stopwords:
             with_stp.update([word])
        else:
           # update count off all words in the line that are not in stopwords
            without_stp.update([word])
    # return a list with top ten most common words from each
    return [k for k,_ in with_stp.most_common(10)],[y for y,_ in without_stp.most_common(10)]

print(content_text(nltk.corpus.inaugural.words('2009-Obama.txt')))

The nltk method includes punctuation so that may not be what you want.

nltk 方法包括标点符号,因此可能不是您想要的。

回答by igorushi

There is a FreqDist function in nltk

nltk 中有一个 FreqDist 函数

import nltk
allWords = nltk.tokenize.word_tokenize(text)
allWordDist = nltk.FreqDist(w.lower() for w in allWords)

stopwords = nltk.corpus.stopwords.words('english')
allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords)    

to extract 10 most common:

提取10个最常见的:

mostCommon= allWordDist.most_common(10).keys()

回答by prahlad

You can try this:

你可以试试这个:

for word, frequency in allWordsDist.most_common(10):
    print('%s;%d' % (word, frequency)).encode('utf-8')