Python 正则表达式 - 为什么字符串结尾($ 和 \Z)不适用于组表达式?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12763548/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python regex - why does end of string ($ and \Z) not work with group expressions?
提问by Piotr Migdal
In Python 2.6. it seems that markers of the end of string $and \Zare not compatible with group expressions. Fo example
在 Python 2.6 中。似乎字符串末尾的那个标记$和\Z不符合组表达式兼容。例如
import re
re.findall("\w+[\s$]", "green pears")
returns
返回
['green ']
(so $effectively does not work). And using
(所以$有效地不起作用)。并使用
re.findall("\w+[\s\Z]", "green pears")
results in an error:
导致错误:
/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.pyc in findall(pattern, string, flags)
175
176 Empty matches are included in the result."""
--> 177 return _compile(pattern, flags).findall(string)
178
179 if sys.hexversion >= 0x02020000:
/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.pyc in _compile(*key)
243 p = sre_compile.compile(pattern, flags)
244 except error, v:
--> 245 raise error, v # invalid expression
246 if len(_cache) >= _MAXCACHE:
247 _cache.clear()
error: internal: unsupported set operator
Why does it work that way and how to go around?
为什么它以这种方式工作以及如何解决?
采纳答案by Martijn Pieters
A [..]expression is a character group, meaning it'll match any one character contained therein. You are thus matching a literal $character. A character group always applies to one input character, and thus can never contain an anchor.
甲[..]表达式是一个字符组,这意味着它会匹配任何一个字符包含在其中。因此,您正在匹配文字$字符。字符组始终适用于一个输入字符,因此永远不能包含锚点。
If you wanted to match either a whitespace character orthe end of the string, use a non-capturing group instead, combined with the |or selector:
如果您想匹配空白字符或字符串的结尾,请改用非捕获组,并结合|or 选择器:
r"\w+(?:\s|$)"
Alternatively, look at the \bword boundary anchor. It'll match anywhere a \wgroup start or ends (so it anchors to points in the text where a \wcharacter is preceded or followed by a \Wcharacter, or is at the start or end of the string).
或者,查看\b单词边界锚点。它将匹配\w组开始或结束的任何地方(因此它锚定到文本中\w字符前面或后面的\W字符,或者在字符串的开头或结尾处)。
回答by BrenBarn
Square brackets don't indicate a group, they indicate a character set, which matches onecharacter (any one of those in the brackets) As documented, "special characters lose their special meaning inside sets" (except where indicated otherwise as with classes like \s).
方括号不表示一个组,它们表示一个字符集,它匹配一个字符(括号中的任何一个)如文档所述,“特殊字符在集合内失去其特殊含义”(除非另有说明,如类\s)。
If you want to match \sor end of string, use something like \s|$.
如果要匹配\s或结束字符串,请使用类似\s|$.
回答by Junji Zhi
Martijn Pieters' answer is correct. To elaborate a bit, if you use capturing groups
Martijn Pieters 的回答是正确的。详细说明一下,如果您使用捕获组
r"\w+(\s|$)"
you get:
你得到:
>>> re.findall("\w+(\s|$)", "green pears")
[' ', '']
That's because re.findall()returns the captured group (\s|$)values.
那是因为re.findall()返回捕获的组(\s|$)值。
Parentheses ()are used for two purposes: character groups and captured groups. To disable captured groups but still act as character groups, use (?:...)syntax:
括号()有两个用途:字符组和捕获组。要禁用捕获的组但仍充当字符组,请使用(?:...)语法:
>>> re.findall("\w+(?:\s|$)", "green pears")
['green ', 'pears']

