Python 使用 NLTK 去除停用词和文档标记化

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17390326/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:05:08  来源:igfitidea点击:

Getting rid of stop words and document tokenization using NLTK

pythonnltktokenizestop-words

提问by Tiger1

I'm having difficulty eliminating and tokenizing a .text file using nltk. I keep getting the following AttributeError: 'list' object has no attribute 'lower'.

我在使用 .text 文件消除和标记化 .text 文件时遇到了困难nltk。我不断收到以下信息AttributeError: 'list' object has no attribute 'lower'

I just can't figure out what I'm doing wrong, although it's my first time of doing something like this. Below are my lines of code.I'll appreciate any suggestions, thanks

我只是无法弄清楚我做错了什么,尽管这是我第一次做这样的事情。以下是我的代码行。我将不胜感激任何建议,谢谢

    import nltk
    from nltk.corpus import stopwords
    s = open("C:\zircon\sinbo1.txt").read()
    tokens = nltk.word_tokenize(s)
    def cleanupDoc(s):
            stopset = set(stopwords.words('english'))
        tokens = nltk.word_tokenize(s)
        cleanup = [token.lower()for token in tokens.lower() not in stopset and  len(token)>2]
        return cleanup
    cleanupDoc(s)

采纳答案by alvas

You can use the stopwordslists from NLTK, see How to remove stop words using nltk or python.

您可以使用stopwordsNLTK 中的列表,请参阅如何使用 nltk 或 python 删除停用词

And most probably you would also like to strip off punctuation, you can use string.punctuation, see http://docs.python.org/2/library/string.html:

很可能您还想去掉标点符号,您可以使用string.punctuation,请参阅http://docs.python.org/2/library/string.html

>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> import string
>>> sent = "this is a foo bar, bar black sheep."
>>> stop = stopwords.words('english') + list(string.punctuation)
>>> [i for i in word_tokenize(sent.lower()) if i not in stop]
['foo', 'bar', 'bar', 'black', 'sheep']

回答by arturomp

From the error message, it seems like you're trying to convert a list, not a string, to lowercase. Your tokens = nltk.word_tokenize(s)is probably not returning what you expect (which seems to be a string).

从错误消息来看,您似乎正在尝试将列表而不是字符串转换为小写。您tokens = nltk.word_tokenize(s)可能没有返回您期望的内容(这似乎是一个字符串)。

It would be helpful to know what format your sinbo.txtfile is in.

知道您的sinbo.txt文件是什么格式会很有帮助。

A few syntax issues:

一些语法问题:

  1. Import should be in lowercase: import nltk

  2. The line s = open("C:\zircon\sinbo1.txt").read()is reading the whole file in, not a single line at a time. This may be problematic because word_tokenize works on a single sentence, not any sequence of tokens. This current line assumes that your sinbo.txtfile contains a single sentence. If it doesn't, you may want to either (a) use a for loop on the file instead of using read() or (b) use punct_tokenizer on a whole bunch of sentences divided by punctuation.

  3. The first line of your cleanupDocfunction is not properly indented. your function should look like this (even if the functions within it change).

    import nltk
    from nltk.corpus import stopwords 
    def cleanupDoc(s):
     stopset = set(stopwords.words('english'))
     tokens = nltk.word_tokenize(s)
     cleanup = [token.lower() for token in tokens if token.lower() not in stopset and  len(token)>2]
     return cleanup
    
  1. 导入应为小写: import nltk

  2. 该行s = open("C:\zircon\sinbo1.txt").read()正在读取整个文件,而不是一次读取一行。这可能有问题,因为 word_tokenize 作用于单个句子,而不是任何标记序列。当前这一行假定您的sinbo.txt文件包含一个句子。如果没有,您可能想要 (a) 在文件上使用 for 循环而不是使用 read() 或 (b) 在一大堆由标点符号划分的句子上使用 punct_tokenizer。

  3. cleanupDoc函数的第一行没有正确缩进。您的函数应如下所示(即使其中的函数发生变化)。

    import nltk
    from nltk.corpus import stopwords 
    def cleanupDoc(s):
     stopset = set(stopwords.words('english'))
     tokens = nltk.word_tokenize(s)
     cleanup = [token.lower() for token in tokens if token.lower() not in stopset and  len(token)>2]
     return cleanup
    

回答by Shivam

import nltk
from nltk.corpus import stopwords
def cleanupDoc(s):
     stopset = set(stopwords.words('english'))
     tokens = nltk.word_tokenize(s)
     cleanup = " ".join(filter(lambda word: word not in stopset, s.split()))
     return cleanup
s = "I am going to disco and bar tonight"
tokens = nltk.word_tokenize(s)
x = cleanupDoc(s)
print x

This code can help in solving the above problem.

此代码可以帮助解决上述问题。

回答by Saahil

In your particular case the error is in cleanup = [token.lower()for token in tokens.lower() not in stopset and len(token)>2]

在您的特定情况下,错误在 cleanup = [token.lower()for token in tokens.lower() not in stopset and len(token)>2]

tokens is a list, so you cannot do tokens.lower() operation on a list. So, another way of writing the above code would be,

令牌是一个列表,因此您不能对列表进行 tokens.lower() 操作。因此,编写上述代码的另一种方法是,

cleanup = [token.lower()for token in tokens if token.lower() not in stopset and  len(token)>2]

I hope this helps.

我希望这有帮助。