Python 打印包含和排除停用词的文本中出现频率最高的 10 个词

Question

提问by user2064809

I got the question from herewith my changes. I have following code:

我从这里得到了我的变化的问题。我有以下代码：

from nltk.corpus import stopwords
>>> def content_text(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() in stopwords]
    return content

How can I printthe 10 most frequently occurring words of a text that 1)includingand 2)excludingstopwords?

如何打印文本中最常出现的 10 个单词，其中 1)包括和 2)排除停用词？

Answer 1

采纳答案by Padraic Cunningham

Not sure on the is stopwordsin the function, I imagine it needs to be inbut you can use a Counterdict with most_common(10)to get the 10 most frequent:

不确定在is stopwords函数中，我想它需要是in但你可以使用 Counterdict withmost_common(10)来获得 10 个最常见的：

from collections import Counter
from string import punctuation


def content_text(text):
    stopwords = set(nltk.corpus.stopwords.words('english')) # 0(1) lookups
    with_stp = Counter()
    without_stp  = Counter()
    with open(text) as f:
        for line in f:
            spl = line.split()
            # update count off all words in the line that are in stopwrods
            with_stp.update(w.lower().rstrip(punctuation) for w in spl if w.lower() in stopwords)
               # update count off all words in the line that are not in stopwords
            without_stp.update(w.lower().rstrip(punctuation)  for w in spl if w  not in stopwords)
    # return a list with top ten most common words from each 
    return [x for x in with_stp.most_common(10)],[y for y in without_stp.most_common(10)]
wth_stop, wthout_stop = content_text(...)

If you are passing in an nltk file object just iterate over it:

如果您传入一个 nltk 文件对象，只需对其进行迭代：

def content_text(text):
    stopwords = set(nltk.corpus.stopwords.words('english'))
    with_stp = Counter()
    without_stp  = Counter()
    for word in text:
        # update count off all words in the line that are in stopwords
        word = word.lower()
        if word in stopwords:
             with_stp.update([word])
        else:
           # update count off all words in the line that are not in stopwords
            without_stp.update([word])
    # return a list with top ten most common words from each
    return [k for k,_ in with_stp.most_common(10)],[y for y,_ in without_stp.most_common(10)]

print(content_text(nltk.corpus.inaugural.words('2009-Obama.txt')))

The nltk method includes punctuation so that may not be what you want.

nltk 方法包括标点符号，因此可能不是您想要的。

Answer 2

回答by igorushi

There is a FreqDist function in nltk

nltk 中有一个 FreqDist 函数

import nltk
allWords = nltk.tokenize.word_tokenize(text)
allWordDist = nltk.FreqDist(w.lower() for w in allWords)

stopwords = nltk.corpus.stopwords.words('english')
allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords)

to extract 10 most common:

提取10个最常见的：

mostCommon= allWordDist.most_common(10).keys()

Answer 3

回答by prahlad

You can try this:

你可以试试这个：

for word, frequency in allWordsDist.most_common(10):
    print('%s;%d' % (word, frequency)).encode('utf-8')

Python 打印包含和排除停用词的文本中出现频率最高的 10 个词

提问by user2064809

采纳答案by Padraic Cunningham

回答by igorushi

回答by prahlad

相关推荐

最近更新

标签

Python 打印包含和排除停用词的文本中出现频率最高的 10 个词

提问by user2064809

采纳答案by Padraic Cunningham

回答by igorushi

回答by prahlad

相关推荐

Python 将数据附加到现有的 Excel 电子表格

XLRD/Python：使用 for 循环将 Excel 文件读入 dict

Python 3.4 用户输入

Python pandas 数据框的最大大小

相关推荐

最近更新

标签