Python 使用 NLTK 去除停用词

Question

提问by Grahesh Parkar

I am trying to process a user entered text by removing stopwords using nltk toolkit, but with stopword-removal the words like 'and', 'or', 'not' gets removed. I want these words to be present after stopword removal process as they are operators which are required for later processing text as query. I don't know which are the words which can be operators in text query, and I also want to remove unnecessary words from my text.

我正在尝试通过使用 nltk 工具包删除停用词来处理用户输入的文本，但是使用停用词删除，诸如“and”、“or”、“not”之类的词会被删除。我希望这些词在停用词删除过程后出现，因为它们是稍后将文本作为查询处理所需的运算符。我不知道哪些词可以作为文本查询中的运算符，我也想从我的文本中删除不必要的词。

Answer 1

采纳答案by otus

I suggest you create your own list of operator words that you take out of the stopword list. Sets can be conveniently subtracted, so:

我建议您创建自己的操作员词列表，从停用词列表中取出。集合可以方便地相减，所以：

operators = set(('and', 'or', 'not'))
stop = set(stopwords...) - operators

Then you can simply test if a word is inor not inthe set without relying on whether your operators are part of the stopword list. You can then later switch to another stopword list or add an operator.

然后，您可以简单地测试一个词是否是词in或词组not in，而无需依赖您的运算符是否是停用词列表的一部分。然后您可以稍后切换到另一个停用词列表或添加运算符。

if word.lower() not in stop:
    # use word

Answer 2

回答by alvas

There is an in-built stopword list in NLTKmade up of 2,400 stopwords for 11 languages (Porter et al), see http://nltk.org/book/ch02.html

有一个内置的停用词列表，NLTK由 11 种语言的 2,400 个停用词组成（Porter 等人），请参阅http://nltk.org/book/ch02.html

>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> stop = set(stopwords.words('english'))
>>> sentence = "this is a foo bar sentence"
>>> print([i for i in sentence.lower().split() if i not in stop])
['foo', 'bar', 'sentence']
>>> [i for i in word_tokenize(sentence.lower()) if i not in stop] 
['foo', 'bar', 'sentence']

I recommend looking at using tf-idf to remove stopwords, see Effects of Stemming on the term frequency?

我建议考虑使用 tf-idf 来删除停用词，请参阅词干提取对词频的影响？

Answer 3

回答by Aamir Adnan

@alvas has a good answer. But again it depends on the nature of the task, for example in your application you want to consider all conjunctione.g. and, or, but, if, whileand all determinere.g. the, a, some, most, every, noas stop words considering all others parts of speech as legitimate, then you might want to look into this solution which use Part-of-Speech Tagset to discard words, Check table 5.1:

@alvas 有一个很好的答案。但这又取决于任务的性质，例如在您的应用程序中，您要考虑所有，conjunction例如和，或，但是，如果，同时和所有，determiner例如，a，一些，大多数，每一个，不考虑所有作为停用词其他词性是合法的，那么您可能想研究这个使用词性标签集丢弃单词的解决方案，检查表 5.1：

import nltk

STOP_TYPES = ['DET', 'CNJ']

text = "some data here "
tokens = nltk.pos_tag(nltk.word_tokenize(text))
good_words = [w for w, wtype in tokens if wtype not in STOP_TYPES]

Answer 4

回答by Salvador Dali

@alvas's answer does the job but it can be done way faster. Assuming that you have documents: a list of strings.

@alvas 的回答可以完成这项工作，但可以更快地完成。假设你有documents：一个字符串列表。

from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize

stop_words = set(stopwords.words('english'))
stop_words.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}']) # remove it if you need punctuation 

for doc in documents:
    list_of_words = [i.lower() for i in wordpunct_tokenize(doc) if i.lower() not in stop_words]

Notice that due to the fact that here you are searching in a set (not in a list) the speed would be theoretically len(stop_words)/2times faster, which is significant if you need to operate through many documents.

请注意，由于您在这里搜索的是集合（而不是列表），因此理论上速度会len(stop_words)/2快几倍，如果您需要操作许多文档，这一点很重要。

For 5000 documents of approximately 300 words each the difference is between 1.8 seconds for my example and 20 seconds for @alvas's.

对于每个大约 300 个单词的 5000 个文档，差异在我的示例的 1.8 秒和@alvas 的 20 秒之间。

P.S. in most of the cases you need to divide the text into words to perform some other classification tasks for which tf-idf is used. So most probably it would be better to use stemmer as well:

PS 在大多数情况下，您需要将文本分成单词来执行其他一些使用 tf-idf 的分类任务。所以很可能最好也使用词干分析器：

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

and to use [porter.stem(i.lower()) for i in wordpunct_tokenize(doc) if i.lower() not in stop_words]inside of a loop.

并[porter.stem(i.lower()) for i in wordpunct_tokenize(doc) if i.lower() not in stop_words]在循环内使用。

Answer 5

回答by UsmanZ

You can use string.punctuationwith built-in NLTK stopwords list:

您可以将string.punctuation与内置的 NLTK 停用词列表一起使用：

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation

words = tokenize(text)
wordsWOStopwords = removeStopWords(words)

def tokenize(text):
        sents = sent_tokenize(text)
        return [word_tokenize(sent) for sent in sents]

def removeStopWords(words):
        customStopWords = set(stopwords.words('english')+list(punctuation))
        return [word for word in words if word not in customStopWords]

NLTK stopwords complete list

NLTK 停用词完整列表

Python 使用 NLTK 去除停用词

提问by Grahesh Parkar

采纳答案by otus

回答by alvas

回答by Aamir Adnan

回答by Salvador Dali

回答by UsmanZ

相关推荐

最近更新

标签

Python 使用 NLTK 去除停用词

提问by Grahesh Parkar

采纳答案by otus

回答by alvas

回答by Aamir Adnan

回答by Salvador Dali

回答by UsmanZ

相关推荐

Python 将列表列表放入 Pandas DataFrame

Python 按升序对目录中的文件名进行排序

如何在 Python 中打印不带括号的元组列表

Python 折叠 jupyter 笔记本中的单元格

相关推荐

最近更新

标签