Python 如何使用 NLTK 标记器去除标点符号?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15547409/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to get rid of punctuation using NLTK tokenizer?
提问by lizarisk
I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenizedoesn't work with multiple sentences: dots are added to the last word.
我刚刚开始使用 NLTK,我不太明白如何从文本中获取单词列表。如果我使用nltk.word_tokenize(),我会得到一个单词和标点符号列表。我只需要单词。我怎样才能摆脱标点符号?也word_tokenize没有多话来:点加到硬道理。
回答by palooh
As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a 'str' encoded with some encoding like 'utf-8').
正如评论中注意到的以 sent_tokenize() 开头,因为 word_tokenize() 仅适用于单个句子。您可以使用 filter() 过滤掉标点符号。并且如果您有一个 unicode 字符串,请确保它是一个 unicode 对象(而不是使用诸如“utf-8”之类的某种编码编码的“str”)。
from nltk.tokenize import word_tokenize, sent_tokenize
text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)
回答by rmalouf
Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:
在此处查看nltk 提供的其他标记化选项。例如,您可以定义一个分词器,它挑选出字母数字字符序列作为标记并删除其他所有内容:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
Output:
输出:
['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']
回答by vish
I just used the following code, which removed all the punctuation:
我只是使用了以下代码,它删除了所有标点符号:
tokens = nltk.wordpunct_tokenize(raw)
type(tokens)
text = nltk.Text(tokens)
type(text)
words = [w.lower() for w in text if w.isalpha()]
回答by zhenv5
I use this code to remove punctuation:
我使用此代码删除标点符号:
import nltk
def getTerms(sentences):
tokens = nltk.word_tokenize(sentences)
words = [w.lower() for w in tokens if w.isalnum()]
print tokens
print words
getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")
And If you want to check whether a token is a valid English word or not, you may need PyEnchant
如果你想检查一个令牌是否是一个有效的英文单词,你可能需要PyEnchant
Tutorial:
教程:
import enchant
d = enchant.Dict("en_US")
d.check("Hello")
d.check("Helo")
d.suggest("Helo")
回答by Salvador Dali
You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:
您实际上并不需要 NLTK 来删除标点符号。您可以使用简单的python 删除它。对于字符串:
import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)
Or for unicode:
或者对于 unicode:
import string
translate_table = dict((ord(char), None) for char in string.punctuation)
s.translate(translate_table)
and then use this string in your tokenizer.
然后在您的标记器中使用此字符串。
P.S.string module have some other sets of elements that can be removed (like digits).
PS字符串模块有一些其他可以删除的元素集(如数字)。
回答by Quan Gan
I think you need some sort of regular expression matching (the following code is in Python 3):
我认为您需要某种正则表达式匹配(以下代码在 Python 3 中):
import string
import re
import nltk
s = "I can't do this now, because I'm so tired. Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)
Output:
输出:
['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']
Should work well in most cases since it removes punctuation while preserving tokens like "n't", which can't be obtained from regex tokenizers such as wordpunct_tokenize.
在大多数情况下应该可以很好地工作,因为它删除标点符号,同时保留诸如“n't”之类的标记,这些标记无法从wordpunct_tokenize.
回答by Madura Pradeep
Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.
下面的代码将删除所有标点符号以及非字母字符。抄自他们的书。
http://www.nltk.org/book/ch01.html
http://www.nltk.org/book/ch01.html
import nltk
s = "I can't do this now, because I'm so tired. Please give me some time. @ sd 4 232"
words = nltk.word_tokenize(s)
words=[word.lower() for word in words if word.isalpha()]
print(words)
output
输出
['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']
回答by ascii_walker
Remove punctuaion(It will remove . as well as part of punctuation handling using below code)
删除标点符号(它将删除 . 以及使用以下代码处理的部分标点符号)
tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
text_string = text_string.translate(tbl) #text_string don't have punctuation
w = word_tokenize(text_string) #now tokenize the string
Sample Input/Output:
样本输入/输出:
direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni
['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']
['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']
回答by Bora M. Alper
Sincerely asking, what is a word? If your assumption is that a word consists of alphabetic characters only, you are wrong since words such as can'twill be destroyed into pieces (such as canand t) if you remove punctuation before tokenisation, which is very likely to affect your program negatively.
诚恳的问一句,什么是字?如果你的假设,就是一个字只包含字母,你是因为词语,如错误can't会被破坏成碎片(如can和t),如果你断词之前删除标点符号,这是非常有可能产生负面影响您的程序。
Hence the solution is to tokenise and then remove punctuation tokens.
因此,解决方案是先标记然后删除标点符号。
import string
from nltk.tokenize import word_tokenize
tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']
tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']
...and then if you wish, you can replace certain tokens such as 'mwith am.
...然后如果你愿意,你可以'm用am.
回答by Himanshu Aggarwal
Just adding to the solution by @rmalouf, this will not include any numbers because \w+ is equivalent to [a-zA-Z0-9_]
只是添加到@rmalouf 的解决方案中,这将不包括任何数字,因为 \w+ 等效于 [a-zA-Z0-9_]
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')

