Python 使用 NLTK 去除停用词和文档标记化

Question

提问by Tiger1

I'm having difficulty eliminating and tokenizing a .text file using nltk. I keep getting the following AttributeError: 'list' object has no attribute 'lower'.

我在使用 .text 文件消除和标记化 .text 文件时遇到了困难nltk。我不断收到以下信息AttributeError: 'list' object has no attribute 'lower'。

I just can't figure out what I'm doing wrong, although it's my first time of doing something like this. Below are my lines of code.I'll appreciate any suggestions, thanks

我只是无法弄清楚我做错了什么，尽管这是我第一次做这样的事情。以下是我的代码行。我将不胜感激任何建议，谢谢

    import nltk
    from nltk.corpus import stopwords
    s = open("C:\zircon\sinbo1.txt").read()
    tokens = nltk.word_tokenize(s)
    def cleanupDoc(s):
            stopset = set(stopwords.words('english'))
        tokens = nltk.word_tokenize(s)
        cleanup = [token.lower()for token in tokens.lower() not in stopset and  len(token)>2]
        return cleanup
    cleanupDoc(s)

Answer 1

采纳答案by alvas

You can use the stopwordslists from NLTK, see How to remove stop words using nltk or python.

您可以使用stopwordsNLTK 中的列表，请参阅如何使用 nltk 或 python 删除停用词。

And most probably you would also like to strip off punctuation, you can use string.punctuation, see http://docs.python.org/2/library/string.html:

很可能您还想去掉标点符号，您可以使用string.punctuation，请参阅http://docs.python.org/2/library/string.html：

>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> import string
>>> sent = "this is a foo bar, bar black sheep."
>>> stop = stopwords.words('english') + list(string.punctuation)
>>> [i for i in word_tokenize(sent.lower()) if i not in stop]
['foo', 'bar', 'bar', 'black', 'sheep']

Answer 2

回答by arturomp

From the error message, it seems like you're trying to convert a list, not a string, to lowercase. Your tokens = nltk.word_tokenize(s)is probably not returning what you expect (which seems to be a string).

从错误消息来看，您似乎正在尝试将列表而不是字符串转换为小写。您tokens = nltk.word_tokenize(s)可能没有返回您期望的内容（这似乎是一个字符串）。

It would be helpful to know what format your sinbo.txtfile is in.

知道您的sinbo.txt文件是什么格式会很有帮助。

A few syntax issues:

一些语法问题：

Import should be in lowercase: import nltk
The line s = open("C:\zircon\sinbo1.txt").read()is reading the whole file in, not a single line at a time. This may be problematic because word_tokenize works on a single sentence, not any sequence of tokens. This current line assumes that your sinbo.txtfile contains a single sentence. If it doesn't, you may want to either (a) use a for loop on the file instead of using read() or (b) use punct_tokenizer on a whole bunch of sentences divided by punctuation.

The first line of your cleanupDocfunction is not properly indented. your function should look like this (even if the functions within it change).

import nltk
from nltk.corpus import stopwords 
def cleanupDoc(s):
 stopset = set(stopwords.words('english'))
 tokens = nltk.word_tokenize(s)
 cleanup = [token.lower() for token in tokens if token.lower() not in stopset and  len(token)>2]
 return cleanup

导入应为小写： import nltk
该行s = open("C:\zircon\sinbo1.txt").read()正在读取整个文件，而不是一次读取一行。这可能有问题，因为 word_tokenize 作用于单个句子，而不是任何标记序列。当前这一行假定您的sinbo.txt文件包含一个句子。如果没有，您可能想要 (a) 在文件上使用 for 循环而不是使用 read() 或 (b) 在一大堆由标点符号划分的句子上使用 punct_tokenizer。

cleanupDoc函数的第一行没有正确缩进。您的函数应如下所示（即使其中的函数发生变化）。

import nltk
from nltk.corpus import stopwords 
def cleanupDoc(s):
 stopset = set(stopwords.words('english'))
 tokens = nltk.word_tokenize(s)
 cleanup = [token.lower() for token in tokens if token.lower() not in stopset and  len(token)>2]
 return cleanup

Answer 3

回答by Shivam

import nltk
from nltk.corpus import stopwords
def cleanupDoc(s):
     stopset = set(stopwords.words('english'))
     tokens = nltk.word_tokenize(s)
     cleanup = " ".join(filter(lambda word: word not in stopset, s.split()))
     return cleanup
s = "I am going to disco and bar tonight"
tokens = nltk.word_tokenize(s)
x = cleanupDoc(s)
print x

This code can help in solving the above problem.

此代码可以帮助解决上述问题。

Answer 4

回答by Saahil

In your particular case the error is in cleanup = [token.lower()for token in tokens.lower() not in stopset and len(token)>2]

在您的特定情况下，错误在 cleanup = [token.lower()for token in tokens.lower() not in stopset and len(token)>2]

tokens is a list, so you cannot do tokens.lower() operation on a list. So, another way of writing the above code would be,

令牌是一个列表，因此您不能对列表进行 tokens.lower() 操作。因此，编写上述代码的另一种方法是，

cleanup = [token.lower()for token in tokens if token.lower() not in stopset and  len(token)>2]

I hope this helps.

我希望这有帮助。

Python 使用 NLTK 去除停用词和文档标记化

提问by Tiger1

采纳答案by alvas

回答by arturomp

回答by Shivam

回答by Saahil

相关推荐

最近更新

标签

Python 使用 NLTK 去除停用词和文档标记化

提问by Tiger1

采纳答案by alvas

回答by arturomp

回答by Shivam

回答by Saahil

相关推荐

Python pip/easy_install 失败：无法创建进程

Python json.dumps 弄乱了顺序

如何使用 pip 搜索可用的 Python 包？

浏览器中的 Python：如何在 Brython、PyPy.js、Skulpt 和 Transcrypt 之间进行选择？

相关推荐

最近更新

标签