Python 使用 NLTK 去除停用词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19130512/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Stopword removal with NLTK
提问by Grahesh Parkar
I am trying to process a user entered text by removing stopwords using nltk toolkit, but with stopword-removal the words like 'and', 'or', 'not' gets removed. I want these words to be present after stopword removal process as they are operators which are required for later processing text as query. I don't know which are the words which can be operators in text query, and I also want to remove unnecessary words from my text.
我正在尝试通过使用 nltk 工具包删除停用词来处理用户输入的文本,但是使用停用词删除,诸如“and”、“or”、“not”之类的词会被删除。我希望这些词在停用词删除过程后出现,因为它们是稍后将文本作为查询处理所需的运算符。我不知道哪些词可以作为文本查询中的运算符,我也想从我的文本中删除不必要的词。
采纳答案by otus
I suggest you create your own list of operator words that you take out of the stopword list. Sets can be conveniently subtracted, so:
我建议您创建自己的操作员词列表,从停用词列表中取出。集合可以方便地相减,所以:
operators = set(('and', 'or', 'not'))
stop = set(stopwords...) - operators
Then you can simply test if a word is in
or not in
the set without relying on whether your operators are part of the stopword list. You can then later switch to another stopword list or add an operator.
然后,您可以简单地测试一个词是否是词in
或词组not in
,而无需依赖您的运算符是否是停用词列表的一部分。然后您可以稍后切换到另一个停用词列表或添加运算符。
if word.lower() not in stop:
# use word
回答by alvas
There is an in-built stopword list in NLTK
made up of 2,400 stopwords for 11 languages (Porter et al), see http://nltk.org/book/ch02.html
有一个内置的停用词列表,NLTK
由 11 种语言的 2,400 个停用词组成(Porter 等人),请参阅http://nltk.org/book/ch02.html
>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> stop = set(stopwords.words('english'))
>>> sentence = "this is a foo bar sentence"
>>> print([i for i in sentence.lower().split() if i not in stop])
['foo', 'bar', 'sentence']
>>> [i for i in word_tokenize(sentence.lower()) if i not in stop]
['foo', 'bar', 'sentence']
I recommend looking at using tf-idf to remove stopwords, see Effects of Stemming on the term frequency?
我建议考虑使用 tf-idf 来删除停用词,请参阅词干提取对词频的影响?
回答by Aamir Adnan
@alvas has a good answer. But again it depends on the nature of the task, for example in your application you want to consider all conjunction
e.g. and, or, but, if, whileand all determiner
e.g. the, a, some, most, every, noas stop words considering all others parts of speech as legitimate, then you might want to look into this solution which use Part-of-Speech Tagset to discard words, Check table 5.1:
@alvas 有一个很好的答案。但这又取决于任务的性质,例如在您的应用程序中,您要考虑所有,conjunction
例如和,或,但是,如果,同时和所有,determiner
例如,a,一些,大多数,每一个,不考虑所有作为停用词其他词性是合法的,那么您可能想研究这个使用词性标签集丢弃单词的解决方案,检查表 5.1:
import nltk
STOP_TYPES = ['DET', 'CNJ']
text = "some data here "
tokens = nltk.pos_tag(nltk.word_tokenize(text))
good_words = [w for w, wtype in tokens if wtype not in STOP_TYPES]
回答by Salvador Dali
@alvas's answer does the job but it can be done way faster. Assuming that you have documents
: a list of strings.
@alvas 的回答可以完成这项工作,但可以更快地完成。假设你有documents
:一个字符串列表。
from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize
stop_words = set(stopwords.words('english'))
stop_words.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}']) # remove it if you need punctuation
for doc in documents:
list_of_words = [i.lower() for i in wordpunct_tokenize(doc) if i.lower() not in stop_words]
Notice that due to the fact that here you are searching in a set (not in a list) the speed would be theoretically len(stop_words)/2
times faster, which is significant if you need to operate through many documents.
请注意,由于您在这里搜索的是集合(而不是列表),因此理论上速度会len(stop_words)/2
快几倍,如果您需要操作许多文档,这一点很重要。
For 5000 documents of approximately 300 words each the difference is between 1.8 seconds for my example and 20 seconds for @alvas's.
对于每个大约 300 个单词的 5000 个文档,差异在我的示例的 1.8 秒和@alvas 的 20 秒之间。
P.S. in most of the cases you need to divide the text into words to perform some other classification tasks for which tf-idf is used. So most probably it would be better to use stemmer as well:
PS 在大多数情况下,您需要将文本分成单词来执行其他一些使用 tf-idf 的分类任务。所以很可能最好也使用词干分析器:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
and to use [porter.stem(i.lower()) for i in wordpunct_tokenize(doc) if i.lower() not in stop_words]
inside of a loop.
并[porter.stem(i.lower()) for i in wordpunct_tokenize(doc) if i.lower() not in stop_words]
在循环内使用。
回答by UsmanZ
You can use string.punctuationwith built-in NLTK stopwords list:
您可以将string.punctuation与内置的 NLTK 停用词列表一起使用:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation
words = tokenize(text)
wordsWOStopwords = removeStopWords(words)
def tokenize(text):
sents = sent_tokenize(text)
return [word_tokenize(sent) for sent in sents]
def removeStopWords(words):
customStopWords = set(stopwords.words('english')+list(punctuation))
return [word for word in words if word not in customStopWords]
NLTK stopwords complete list
NLTK 停用词完整列表