Python 怎么去掉标点符号?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23317458/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:44:22  来源:igfitidea点击:

How to remove punctuation?

pythonnlpnltk

提问by user3534472

I am using the tokenizer from NLTK in Python.

在 Python 中使用来自NLTK的标记器。

There are whole bunch of answers for removing punctuations on the forum already. However, none of them address all of the following issues together:

论坛上已经有很多关于删除标点符号的答案。但是,它们都没有同时解决以下所有问题:

  1. More than one symbol in a row. For example, the sentence: He said,"that's it." Because there's a comma followed by quotation mark, the tokenizer won't remove ." in the sentence. The tokenizer will give ['He', 'said', ',"', 'that', 's', 'it.'] instead of ['He','said', 'that', 's', 'it']. Some other examples include '...', '--', '!?', ',"', and so on.
  2. Remove symbol at the end of the sentence. i.e. the sentence: Hello World. The tokenizer will give ['Hello', 'World.'] instead of ['Hello', 'World']. Notice the period at the end of the word 'World'. Some other examples include '--',',' in the beginning, middle, or end of any character.
  3. Remove characters with symbols in front and after. i.e. '*u*', '''','""'
  1. 连续多个符号。例如,这句话:他说,“就是这样。” 因为有一个逗号后跟引号,分词器不会删除句子中的 ."。分词器会给出 ['He', 'said', ',"', 'that', 's', 'it. '] 而不是 ['He','said', 'that', 's', 'it']。其他一些示例包括 '...'、'--'、'!?'、',"' 等。
  2. 去掉句尾的符号。即句子:Hello World。分词器将给出 ['Hello', 'World.'] 而不是 ['Hello', 'World']。请注意“世界”一词末尾的句点。其他一些示例包括在任何字符的开头、中间或结尾处使用“--”、“、”。
  3. 删除前后带有符号的字符。IE'*u*', '''','""'

Is there an elegant way of solving both problems?

有没有一种优雅的方法来解决这两个问题?

采纳答案by π?δα? ?κ??

If you want to tokenize your string all in one shot, I think your only choice will be to use nltk.tokenize.RegexpTokenizer. The following approach will allow you to use punctuation as a marker to remove characters of the alphabet (as noted in your third requirement) before removing the punctuation altogether. In other words, this approach will remove *u*before stripping all punctuation.

如果你想一次性标记你的字符串,我认为你唯一的选择是使用nltk.tokenize.RegexpTokenizer. 以下方法将允许您在完全删除标点符号之前使用标点符号作为标记来删除字母表中的字符(如第三个要求中所述)。换句话说,这种方法将*u*在剥离所有标点符号之前删除。

One way to go about this, then, is to tokenize on gaps like so:

然后,解决此问题的一种方法是对差距进行标记,如下所示:

>>> from nltk.tokenize import RegexpTokenizer
>>> s = '''He said,"that's it." *u* Hello, World.'''
>>> toker = RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True)
>>> toker.tokenize(s)
['He', 'said', 'that', 's', 'it', 'Hello', 'World']  # omits *u* per your third requirement

This should meet all three of the criteria you specified above. Note, however, that this tokenizer will not return tokens such as "A". Furthermore, I only tokenize on single letters that begin andend with punctuation. Otherwise, "Go." would not return a token. You may need to nuance the regex in other ways, depending on what your data looks like and what your expectations are.

这应该满足您在上面指定的所有三个标准。但是请注意,此标记生成器不会返回诸如"A". 此外,我只标记以标点符号开头结尾的单个字母。否则,“走”。不会返回令牌。您可能需要以其他方式对正则表达式进行细微调整,具体取决于您的数据是什么样的以及您的期望是什么。

回答by alvas

Solution 1: Tokenize and strip punctuation off the tokens

解决方案 1:标记并去除标记上的标点符号

>>> from nltk import word_tokenize
>>> import string
>>> punctuations = list(string.punctuation)
>>> punctuations
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> punctuations.append("''")
>>> sent = '''He said,"that's it."'''
>>> word_tokenize(sent)
['He', 'said', ',', "''", 'that', "'s", 'it', '.', "''"]
>>> [i for i in word_tokenize(sent) if i not in punctuations]
['He', 'said', 'that', "'s", 'it']
>>> [i.strip("".join(punctuations)) for i in word_tokenize(sent) if i not in punctuations]
['He', 'said', 'that', 's', 'it']

Solution 2: remove punctuation then tokenize

解决方案2:删除标点符号然后标记化

>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'
>>> sent = '''He said,"that's it."'''
>>> " ".join("".join([" " if ch in string.punctuation else ch for ch in sent]).split())
'He said that s it'
>>> " ".join("".join([" " if ch in string.punctuation else ch for ch in sent]).split()).split()
['He', 'said', 'that', 's', 'it']