python Python正则表达式匹配Unicode属性

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1832893/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 23:10:25  来源:igfitidea点击:

Python regex matching Unicode properties

pythonregexunicodeucdcharacter-properties

提问by ThomasH

Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll}to match an arbitrary lower-case letter, or p{Zs}for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.

Perl 和其他一些当前的正则表达式引擎在正则表达式中支持 Unicode 属性,例如类别。例如,在 Perl 中,您可以使用\p{Ll}匹配任意小写字母或p{Zs}任何空格分隔符。我在 Python 的 2.x 和 3.x 行中都没有看到对此的支持(对此表示遗憾)。有人知道获得类似效果的好策略吗?欢迎使用本土解决方案。

采纳答案by joeforker

Have you tried Ponyguruma, a Python binding to the Onigurumaregular expression engine? In that engine you can simply say \p{Armenian}to match Armenian characters. \p{Ll}or \p{Zs}work too.

您是否尝试过Ponyguruma,一种绑定到Oniguruma正则表达式引擎的 Python ?在该引擎中,您可以简单地说\p{Armenian}匹配亚美尼亚字符。\p{Ll}\p{Zs}工作。

回答by ronnix

The regexmodule (an alternative to the standard remodule) supports Unicode codepoint properties with the \p{}syntax.

正则表达式模块(一个替代标准re模块)支持与Unicode的码点属性\p{}的语法。

回答by zellyn

You can painstakingly use unicodedata on each character:

您可以煞费苦心地在每个字符上使用 unicodedata:

import unicodedata

def strip_accents(x):
    return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')

回答by mgibsonbr

Speaking of homegrown solutions, some time ago I wrote a small programto do just that - convert a unicode category written as \p{...}into a range of values, extracted from the unicode specification(v.5.0.0). Only categories are supported (ex.: L, Zs), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).

说到自产的解决方案,前段时间我写了一个小程序来做到这一点 - 将编写为的 unicode 类别转换为\p{...}从 unicode规范(v.5.0.0) 中提取的一系列值。仅支持类别(例如:LZs),并且仅限于 BMP。我把它贴在这里以防有人觉得它有用(尽管 Oniguruma 看起来确实是一个更好的选择)。

Example usage:

用法示例:

>>> from unicode_hack import regex
>>> pattern = regex(r'^\p{Lu}(\p{L}|\p{N}|_)*')
>>> print pattern.match(u'á??_1+2').group(0)
á??_1
>>>

Here's the source. There is also a JavaScript version, using the same data.

这是来源。还有一个JavaScript 版本,使用相同的数据。

回答by Jonathan Feinberg

You're right that Unicode property classes are not supported by the Python regex parser.

没错,Python 正则表达式解析器不支持 Unicode 属性类。

If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M}or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M}would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F], and \P{M}would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F].

如果您想做一个不错的 hack,这通常很有用,您可以创建一个预处理器,它会扫描字符串中的此类标记(\p{M}或其他)并将它们替换为相应的字符集,例如,\p{M}将变为[\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F],并且\P{M}会变成[^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F].

People would thank you. :)

人们会感谢你。:)

回答by tzot

Note that while \p{Ll}has no equivalent in Python regular expressions, \p{Zs}should be covered by '(?u)\s'. The (?u), as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \smeans any spacing character.

请注意,虽然\p{Ll}在 Python 正则表达式中没有等效项,但\p{Zs}应由'(?u)\s'. 的(?u),作为文档说,“使\ W,\ W,\ B,\ B,\ d,\ d,\ S和\ S依赖于Unicode字符属性数据库”。和\s表示任何间距字符。