Python 对正则表达式中的反斜杠感到困惑
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33582162/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Confused about backslashes in regular expressions
提问by tobmei05
I am confused with the backslash in regular expressions. Within a regex a \
has a special meaning, e.g. \d
means a decimal digit. If you add a backslash in front of the backslash this special meaning gets lost. In the regex-howtoone can read:
我对正则表达式中的反斜杠感到困惑。在正则表达式中 a\
具有特殊含义,例如\d
表示十进制数字。如果在反斜杠前面添加反斜杠,则此特殊含义将丢失。在regex-howto 中可以阅读:
Perhaps the most important metacharacter is the backslash,
\
. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It's also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a[
or\
, you can precede them with a backslash to remove their special meaning:\[
or\\
.
也许最重要的元字符是反斜杠
\
. 与 Python 字符串文字一样,反斜杠后面可以跟各种字符以表示各种特殊序列。它还用于转义所有元字符,以便您仍然可以在模式中匹配它们;例如,如果您需要匹配一个[
or\
,您可以在它们前面加上一个反斜杠以去除它们的特殊含义:\[
or\\
。
So print(re.search('\d', '\d'))
gives None
because \d
matches any decimal digit but there is none in \d
.
所以print(re.search('\d', '\d'))
给出None
因为\d
匹配任何十进制数字,但在\d
.
I now would expect print(re.search('\\d', '\d'))
to match \d
but the answer is still None
.
我现在希望print(re.search('\\d', '\d'))
匹配,\d
但答案仍然是None
.
Only print(re.search('\\\d', '\d'))
gives as output <_sre.SRE_Match object; span=(0, 2), match='\\d'>
.
仅print(re.search('\\\d', '\d'))
作为输出给出<_sre.SRE_Match object; span=(0, 2), match='\\d'>
。
Does someone have an explanation?
有人有解释吗?
采纳答案by Tom Karzes
The confusion is due to the fact that the backslash character \
is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \
before the re
module ever sees your string. For instance, \n
is converted to a newline character, \t
is converted to a tab character, etc. To get an actual \
character, you can escape it as well, so \\
gives a single \
character. If the character following the \
isn't a recognized escape character, then the \
is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \
characters by doubling them, i.e. \\
.
混淆是由于反斜杠字符\
被用作两个不同级别的转义符。首先,Python 解释器本身\
在re
模块看到您的字符串之前执行替换。例如,\n
转换为换行符,\t
转换为制表符等。要获取实际\
字符,您也可以对其进行转义,因此\\
给出单个\
字符。如果 后面的字符\
不是可识别的转义字符,则 将\
被视为任何其他字符并通过,但我不建议依赖于此。相反,总是\
通过加倍来逃避你的角色,即\\
.
If you want to see how Python is expanding your string escapes, just print out the string. For example:
如果您想查看 Python 如何扩展您的字符串转义,只需打印出该字符串。例如:
s = 'a\b\tc'
print(s)
If s
is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \
escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \
escapes.
如果s
是聚合数据类型的一部分,例如列表或元组,并且如果您打印该聚合,Python 会将字符串括在单引号中并包含\
转义符(以规范形式),因此请注意您的字符串如何正在打印。如果你只是在解释器中输入一个带引号的字符串,它也会显示它用带\
转义的引号括起来。
Once you know how your string is being encoded, you can then think about what the re
module will do with it. For instance, if you want to escape \
in a string you pass to the re
module, you will need to pass \\
to re
, which means you will need to use \\\\
in your quoted Python string. The Python string will end up with \\
and the re
module will treat this as a single literal \
character.
一旦你知道你的字符串是如何被编码的,你就可以考虑re
模块将如何处理它。例如,如果您想\
在传递给re
模块的字符串中转义,则需要传递\\
to re
,这意味着您需要\\\\
在引用的 Python 字符串中使用。Python的字符串将结束与\\
和re
模块将其视为一个单一的文字\
字符。
An alternative way to include \
characters in Python strings is to use raw strings, e.g. r'a\b'
is equivalent to "a\\b"
.
\
在 Python 字符串中包含字符的另一种方法是使用原始字符串,例如r'a\b'
等效于"a\\b"
.
回答by glglgl
Python's own string parsing (partially) comes in your way.
Python 自己的字符串解析(部分)会妨碍您。
If you want to see what re
sees, type
如果您想查看re
所见,请键入
print '\d'
print '\d'
print '\\d'
on the Python command prompt. You see that \d
and \\d
both result in \d
, the latter one being taken care by the Python string parser.
在 Python 命令提示符下。您可以看到\d
与\\d
两个结果\d
,后者由Python串分析器被照顾。
If you want to avoid any hassle with these, use raw strings as suggested by the re module documentation: r'\\d'
will result in \\d
seen by the RE module.
如果您想避免这些麻烦,请按照re 模块文档的建议使用原始字符串:r'\\d'
将导致\\d
RE 模块看到。
回答by eric.mcgregor
An r character before the regular expression in a call to search() specifies that the regular expression is a raw string. This allows backslashes to be used in the regular expression as regular characters rather than in an escape sequence of characters. Let me explain ...
调用 search() 中正则表达式之前的 r 字符指定正则表达式是原始字符串。这允许在正则表达式中使用反斜杠作为正则字符而不是字符转义序列。让我解释 ...
Before the re module's search method processes the strings that are passed to it, the Python interpreter takes an initial pass over the string. If there are backslashes present in a string, the Python interpreter must decide if each is part of a Python escape sequence (e.g. \n or \t) or not.
在 re 模块的搜索方法处理传递给它的字符串之前,Python 解释器对字符串进行初始传递。如果字符串中存在反斜杠,Python 解释器必须决定每个反斜杠是否是 Python 转义序列(例如 \n 或 \t)的一部分。
Note: at this point Python does not care whether or not '\' is a regular expression meta-character.
注意:此时 Python 并不关心 '\' 是否是正则表达式元字符。
If the '\' is followed by a recognized Python escape character (t,n, etc.), then the backslash and the escape character are replaced with the actual Unicode or 8-bit character. For example, '\t' would be replaced with the ASCII character for tab. Otherwise it is passed by and interpreted as a '\' character.
如果 '\' 后跟可识别的 Python 转义字符(t、n 等),则反斜杠和转义字符将替换为实际的 Unicode 或 8 位字符。例如,'\t' 将被替换为制表符的 ASCII 字符。否则,它会被传递并解释为“\”字符。
Consider the following.
考虑以下。
>>> s = '\t'
>>> print ("[" + s + "]")
>>> [ ] // an actual tab character after preprocessing
>>> s = '\d'
>>> print ("[" + s + "]")
>>> [\d] // '\d' after preprocessing
Sometimes we want to include in a string a character sequence that includes '\' without it being interpreted by Python as an escape sequence. To do this we escape the '\' with a '\'. Now when Python sees '\' it replaces the two backslashes with a single '\' character.
有时我们希望在字符串中包含一个包含 '\' 的字符序列,而不会被 Python 解释为转义序列。为此,我们用 '\' 对 '\' 进行转义。现在,当 Python 看到 '\' 时,它会用一个 '\' 字符替换两个反斜杠。
>>> s = '\t'
>>> print ("[" + s + "]")
>>> [\t] // '\t' after preprocessing
After the Python interpreter take a pass on both strings, they are passed to the re module's search method. The search method parses the regular expression string to identify the regular expression's meta-characters.
在 Python 解释器通过两个字符串之后,它们被传递给 re 模块的搜索方法。search 方法解析正则表达式字符串以识别正则表达式的元字符。
Now '\' is also a special regular expression meta-character and is interpreted as one UNLESS it is escaped at the time that the re search() method is executed.
现在 '\' 也是一个特殊的正则表达式元字符并且被解释为一个,除非它在执行 re search() 方法时被转义。
Consider the following call.
考虑以下调用。
>>> match = re.search('a\t','a\t') //Match is None
Here, match is None. Why? Lets look at the strings after the Python interpreter makes its pass.
在这里,匹配是无。为什么?让我们看看 Python 解释器通过后的字符串。
String 1: 'a\t'
String 2: 'a\t'
So why is match equal to None? When search() interprets String 1, since it is a regular expression, the backslash is interpreted as a meta-character, not an ordinary character. The backslash in String 2 however is not in a regular expression and has already been processed by the Python interpreter, so it is interpreted as an ordinary character.
那么为什么 match 等于 None 呢?当 search() 解释 String 1 时,由于它是一个正则表达式,反斜杠被解释为元字符,而不是普通字符。然而,String 2 中的反斜杠不在正则表达式中,并且已经被 Python 解释器处理过,所以它被解释为一个普通字符。
So the search() method is looking for 'a escape-t' in the string 'a\t' which are not a match.
所以 search() 方法在字符串 'a\t' 中寻找不匹配的 'a escape-t'。
To fix this we can tell the search() method to not interpret the '\' as a meta-character. We can do this by escaping it.
为了解决这个问题,我们可以告诉 search() 方法不要将 '\' 解释为元字符。我们可以通过逃避它来做到这一点。
Consider the following call.
考虑以下调用。
>>> match = re.search('a\\t','a\t') // Match contains 'a\t'
Again, lets look at the strings after the Python interpreter has made its pass.
同样,让我们看看 Python 解释器通过后的字符串。
String 1: 'a\t'
String 2: 'a\t'
Now when the search() method processes the regular expression, it sees that the second backslash is escaped by the first and should not be considered a meta-character. It therefore interprets the string as 'a\t', which matches String 2.
现在,当 search() 方法处理正则表达式时,它看到第二个反斜杠被第一个转义,不应被视为元字符。因此,它将字符串解释为 'a\t',它与字符串 2 匹配。
An alternate way to have search() consider '\' as a character is to place an r before the regular expression. This tells the Python interpreter to NOT preprocess the string.
让 search() 将 '\' 视为字符的另一种方法是在正则表达式之前放置一个 r。这告诉 Python 解释器不要预处理字符串。
Consider this.
考虑一下。
>>> match = re.search(r'a\t','a\t') // match contains 'a\t'
Here the Python interpreter does not modify the first string but does process the second string. The strings passed to search() are:
这里 Python 解释器不会修改第一个字符串,但会处理第二个字符串。传递给 search() 的字符串是:
String 1: 'a\t'
String 2: 'a\t'
As in the previous example, search interprets the '\' as the single character '\' and not a meta-character, thus matches with String 2.
与前面的示例一样,搜索将 '\' 解释为单个字符 '\' 而不是元字符,因此与字符串 2 匹配。