Python 正则表达式，删除除 unicode 字符串的连字符以外的所有标点符号

Question

提问by John

I have this code for removing all punctuation from a regex string:

我有这个用于从正则表达式字符串中删除所有标点符号的代码：

import regex as re    
re.sub(ur"\p{P}+", "", txt)

How would I change it to allow hyphens? If you could explain how you did it, that would be great. I understand that here, correct me if I'm wrong, P with anything after it is punctuation.

我将如何更改它以允许连字符？如果你能解释一下你是如何做到的，那就太好了。我明白了，如果我错了，请纠正我，标点符号后的任何内容都可以。

Answer 1

采纳答案by Kobi

[^\P{P}-]+

\Pis the complementary of \p- not punctuation. So this matches anything that is not(not punctuation or a dash) - resulting in all punctuation except dashes.

\P是补充\p- 不是标点符号。所以这匹配任何不是（不是标点符号或破折号）的东西 - 导致除破折号之外的所有标点符号。

Example: http://www.rubular.com/r/JsdNM3nFJ3

示例：http: //www.rubular.com/r/JsdNM3nFJ3

If you want a non-convoluted way, an alternative is \p{P}(?<!-): match all punctuation, and then check it wasn't a dash (using negative lookbehind).
Working example: http://www.rubular.com/r/5G62iSYTdk

如果您想要一种简单的方式，另一种方法是\p{P}(?<!-)：匹配所有标点符号，然后检查它是否不是破折号（使用否定后视）。
工作示例：http: //www.rubular.com/r/5G62iSYTdk

Answer 2

回答by Cu3PO42

You could either specify the punctuation you want to remove manually, as in [._,]or supply a function instead of the replacement string:

您可以指定要手动删除的标点符号，例如[._,]或提供函数而不是替换字符串：

re.sub(r"\p{P}", lambda m: "-" if m.group(0) == "-" else "", text)

Answer 3

回答by Galen Long

Here's how to do it with the remodule, in case you have to stick with the standard libraries:

以下是使用re模块的方法，以防您必须坚持使用标准库：

# works in python 2 and 3
import re
import string

remove = string.punctuation
remove = remove.replace("-", "") # don't remove hyphens
pattern = r"[{}]".format(remove) # create the pattern

txt = ")*^%{}[]thi's - is - @@#!a !%%!!%- test."
re.sub(pattern, "", txt) 
# >>> 'this - is - a - test'

If performance matters, you may want to use str.translate, since it's faster than using a regex. In Python 3, the code is txt.translate({ord(char): None for char in remove}).

如果性能很重要，您可能想要使用str.translate，因为它比使用 regex 更快。在 Python 3 中，代码是txt.translate({ord(char): None for char in remove}).

Python 正则表达式，删除除 unicode 字符串的连字符以外的所有标点符号

提问by John

采纳答案by Kobi

回答by Cu3PO42

回答by Galen Long

相关推荐

最近更新

标签

Python 正则表达式，删除除 unicode 字符串的连字符以外的所有标点符号

提问by John

采纳答案by Kobi

回答by Cu3PO42

回答by Galen Long

相关推荐

Python 如果集合为空则返回布尔值

Python - 没有空格的 json

Python 在字典中将字符串键转换为 int

Python 在 Flask 中连接数据库，哪种方法更好？

相关推荐

最近更新

标签