python Python正则表达式匹配Unicode属性
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1832893/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python regex matching Unicode properties
提问by ThomasH
Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll}
to match an arbitrary lower-case letter, or p{Zs}
for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.
Perl 和其他一些当前的正则表达式引擎在正则表达式中支持 Unicode 属性,例如类别。例如,在 Perl 中,您可以使用\p{Ll}
匹配任意小写字母或p{Zs}
任何空格分隔符。我在 Python 的 2.x 和 3.x 行中都没有看到对此的支持(对此表示遗憾)。有人知道获得类似效果的好策略吗?欢迎使用本土解决方案。
采纳答案by joeforker
Have you tried Ponyguruma, a Python binding to the Onigurumaregular expression engine? In that engine you can simply say \p{Armenian}
to match Armenian characters. \p{Ll}
or \p{Zs}
work too.
您是否尝试过Ponyguruma,一种绑定到Oniguruma正则表达式引擎的 Python ?在该引擎中,您可以简单地说\p{Armenian}
匹配亚美尼亚字符。\p{Ll}
或\p{Zs}
工作。
回答by ronnix
回答by zellyn
You can painstakingly use unicodedata on each character:
您可以煞费苦心地在每个字符上使用 unicodedata:
import unicodedata
def strip_accents(x):
return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')
回答by mgibsonbr
Speaking of homegrown solutions, some time ago I wrote a small programto do just that - convert a unicode category written as \p{...}
into a range of values, extracted from the unicode specification(v.5.0.0). Only categories are supported (ex.: L
, Zs
), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).
说到自产的解决方案,前段时间我写了一个小程序来做到这一点 - 将编写为的 unicode 类别转换为\p{...}
从 unicode规范(v.5.0.0) 中提取的一系列值。仅支持类别(例如:L
、Zs
),并且仅限于 BMP。我把它贴在这里以防有人觉得它有用(尽管 Oniguruma 看起来确实是一个更好的选择)。
Example usage:
用法示例:
>>> from unicode_hack import regex
>>> pattern = regex(r'^\p{Lu}(\p{L}|\p{N}|_)*')
>>> print pattern.match(u'á??_1+2').group(0)
á??_1
>>>
Here's the source. There is also a JavaScript version, using the same data.
这是来源。还有一个JavaScript 版本,使用相同的数据。
回答by Jonathan Feinberg
You're right that Unicode property classes are not supported by the Python regex parser.
没错,Python 正则表达式解析器不支持 Unicode 属性类。
If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M}
or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M}
would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]
, and \P{M}
would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]
.
如果您想做一个不错的 hack,这通常很有用,您可以创建一个预处理器,它会扫描字符串中的此类标记(\p{M}
或其他)并将它们替换为相应的字符集,例如,\p{M}
将变为[\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]
,并且\P{M}
会变成[^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]
.
People would thank you. :)
人们会感谢你。:)
回答by tzot
Note that while \p{Ll}
has no equivalent in Python regular expressions, \p{Zs}
should be covered by '(?u)\s'
.
The (?u)
, as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \s
means any spacing character.
请注意,虽然\p{Ll}
在 Python 正则表达式中没有等效项,但\p{Zs}
应由'(?u)\s'
. 的(?u)
,作为文档说,“使\ W,\ W,\ B,\ B,\ d,\ d,\ S和\ S依赖于Unicode字符属性数据库”。和\s
表示任何间距字符。