Python 正则表达式,删除除 unicode 字符串的连字符以外的所有标点符号

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21209024/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 22:18:25  来源:igfitidea点击:

Python regex, remove all punctuation except hyphen for unicode string

pythonregexstring

提问by John

I have this code for removing all punctuation from a regex string:

我有这个用于从正则表达式字符串中删除所有标点符号的代码:

import regex as re    
re.sub(ur"\p{P}+", "", txt)

How would I change it to allow hyphens? If you could explain how you did it, that would be great. I understand that here, correct me if I'm wrong, P with anything after it is punctuation.

我将如何更改它以允许连字符?如果你能解释一下你是如何做到的,那就太好了。我明白了,如果我错了,请纠正我,标点符号后的任何内容都可以。

采纳答案by Kobi

[^\P{P}-]+

\Pis the complementary of \p- not punctuation. So this matches anything that is not(not punctuation or a dash) - resulting in all punctuation except dashes.

\P是补充\p- 不是标点符号。所以这匹配任何不是(不是标点符号或破折号)的东西 - 导致除破折号之外的所有标点符号。

Example: http://www.rubular.com/r/JsdNM3nFJ3

示例:http: //www.rubular.com/r/JsdNM3nFJ3

If you want a non-convoluted way, an alternative is \p{P}(?<!-): match all punctuation, and then check it wasn't a dash (using negative lookbehind).
Working example: http://www.rubular.com/r/5G62iSYTdk

如果您想要一种简单的方式,另一种方法是\p{P}(?<!-):匹配所有标点符号,然后检查它是否不是破折号(使用否定后视)。
工作示例:http: //www.rubular.com/r/5G62iSYTdk

回答by Cu3PO42

You could either specify the punctuation you want to remove manually, as in [._,]or supply a function instead of the replacement string:

您可以指定要手动删除的标点符号,例如[._,]或提供函数而不是替换字符串:

re.sub(r"\p{P}", lambda m: "-" if m.group(0) == "-" else "", text)

回答by Galen Long

Here's how to do it with the remodule, in case you have to stick with the standard libraries:

以下是使用re模块的方法,以防您必须坚持使用标准库:

# works in python 2 and 3
import re
import string

remove = string.punctuation
remove = remove.replace("-", "") # don't remove hyphens
pattern = r"[{}]".format(remove) # create the pattern

txt = ")*^%{}[]thi's - is - @@#!a !%%!!%- test."
re.sub(pattern, "", txt) 
# >>> 'this - is - a - test'

If performance matters, you may want to use str.translate, since it's faster than using a regex. In Python 3, the code is txt.translate({ord(char): None for char in remove}).

如果性能很重要,您可能想要使用str.translate,因为它比使用 regex 更快。在 Python 3 中,代码是txt.translate({ord(char): None for char in remove}).