Python 究竟什么是“原始字符串正则表达式”以及如何使用它?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12871066/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 12:03:35  来源:igfitidea点击:

What exactly is a "raw string regex" and how can you use it?

pythonregexpython-modulerawstring

提问by temporary_user_name

From the python documentation on regex, regarding the '\'character:

从关于regex的 python 文档中,关于'\'字符:

The solution is to use Python's raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n"is a two-character string containing '\'and 'n', while "\n"is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.

解决方案是对正则表达式模式使用 Python 的原始字符串表示法;反斜杠不会以任何特殊方式在以'r'. 所以,r"\n"是包含两个字符的字符串'\''n',虽然"\n"是包含一个换行符一个一个字符的字符串。通常模式将使用这种原始字符串表示法在 Python 代码中表示。

What is this raw string notation? If you use a raw string format, does that mean "*"is taken as a a literal character rather than a zero-or-more indicator? That obviously can't be right, or else regex would completely lose its power. But then if it's a raw string, how does it recognize newline characters if "\n"is literally a backslash and an "n"?

这个原始字符串表示法是什么?如果您使用原始字符串格式,这是否意味着"*"将其视为文字字符而不是零个或多个指示符?这显然不可能是正确的,否则正则表达式将完全失去作用。但是如果它是一个原始字符串,如果"\n"字面上是反斜杠和一个,它如何识别换行符"n"

I don't follow.

我不跟。

Edit for bounty:

编辑赏金:

I'm trying to understand how a raw string regex matches newlines, tabs, and character sets, e.g. \wfor words or \dfor digits or all whatnot, if raw string patterns don't recognize backslashes as anything more than ordinary characters. I could really use some good examples.

我试图了解原始字符串正则表达式如何匹配换行符、制表符和字符集,例如\w单词或\d数字或所有其他内容,如果原始字符串模式不能将反斜杠识别为普通字符以外的任何东西。我真的可以使用一些很好的例子。

采纳答案by Jim DeLaHunt

Zarkonnen's response does answer your question, but not directly. Let me try to be more direct, and see if I can grab the bounty from Zarkonnen.

Zarkonnen 的回答确实回答了您的问题,但不是直接回答。让我试着更直接一点,看看我能不能从扎科宁那里拿到赏金。

You will perhaps find this easier to understand if you stop using the terms "raw string regex" and "raw string patterns". These terms conflate two separate concepts: the representations of a particular string in Python source code, and what regular expression that string represents.

如果您停止使用术语“原始字符串正则表达式”和“原始字符串模式”,您可能会发现这更容易理解。这些术语将两个独立的概念混为一谈:Python 源代码中特定字符串的表示,以及该字符串表示的正则表达式。

In fact, it's helpful to think of these as two different programming languages, each with their own syntax. The Python language has source code that, among other things, builds strings with certain contents, and calls the regular expression system. The regular expression system has source code that resides in string objects, and matches strings. Both languages use backslash as an escape character.

事实上,将它们视为两种不同的编程语言很有帮助,每种语言都有自己的语法。Python 语言有源代码,其中包括构建具有特定内容的字符串,并调用正则表达式系统。正则表达式系统具有驻留在字符串对象中并匹配字符串的源代码。两种语言都使用反斜杠作为转义字符。

First, understand that a string is a sequence of characters (i.e. bytes or Unicode code points; the distinction doesn't much matter here). There are many ways to represent a string in Python source code. A raw stringis simply one of these representations. If two representations result in the same sequence of characters, they produce equivalent behaviour.

首先,理解字符串是一个字符序列(即字节或 Unicode 代码点;这里的区别并不重要)。在 Python 源代码中有多种表示字符串的方法。一个原始字符串仅仅是这些表象之一。如果两个表示产生相同的字符序列,则它们产生等效的行为。

Imagine a 2-character string, consisting of the backslashcharacter followed by the ncharacter. If you know that the character value for backslashis 92, and for nis 110, then this expression generates our string:

想象一个 2 个字符的字符串,由反斜杠字符后跟n字符组成。如果您知道反斜杠的字符值为92,而n的字符值为110,那么此表达式将生成我们的字符串:

s = chr(92)+chr(110)
print len(s), s

2 \n

The conventional Python string notation "\n"does not generate this string. Instead it generates a one-character string with a newline character. The Python docs 2.4.1. String literalssay, "The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character."

传统的 Python 字符串表示法"\n"不会生成此字符串。相反,它生成一个带有换行符的单字符字符串。在Python文档2.4.1。字符串文字表示,“反斜杠 (\) 字符用于转义具有特殊含义的字符,例如换行符、反斜杠本身或引号字符。”

s = "\n"
print len(s), s

1 
?

(Note that the newline isn't visible in this example, but if you look carefully, you'll see a blank line after the "1".)

(请注意,在此示例中换行符不可见,但如果仔细查看,您会在“1”之后看到一个空行。)

To get our two-character string, we have to use another backslashcharacter to escape the special meaning of the original backslashcharacter:

为了得到我们的两个字符的字符串,我们必须使用另一个反斜杠字符来转义原始反斜杠字符的特殊含义:

s = "\n"
print len(s), s

2 \n

What if you want to represent strings that have many backslashcharacters in them? Python docs 2.4.1. String literalscontinue, "String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw stringsand use different rules for interpreting backslash escape sequences." Here is our two-character string, using raw string representation:

如果要表示包含许多反斜杠字符的字符串怎么办?Python 文档2.4.1。字符串文字继续,“字符串文字可以选择以字母 'r' 或 'R' 为前缀;此类字符串称为原始字符串,并使用不同的规则来解释反斜杠转义序列。” 这是我们的两个字符的字符串,使用原始字符串表示:

s = r"\n"
print len(s), s

2 \n

So we have three different string representations, all giving the same string, or sequence of characters:

所以我们有三种不同的字符串表示,都给出相同的字符串或字符序列:

print chr(92)+chr(110) == "\n" == r"\n"
True

Now, let's turn to regular expressions. The Python docs, 7.2. reRegular expression operationssays, "Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python's usage of the same character for the same purpose in string literals..."

现在,让我们转向正则表达式。在Python文档,7.2。re正则表达式操作说,“正则表达式使用反斜杠字符 ('\') 来表示特殊形式或允许使用特殊字符而不调用它们的特殊含义。这与 Python 在字符串文字..."

If you want a Python regular expression object which matches a newline character, then you need a 2-character string, consisting of the backslashcharacter followed by the ncharacter. The following lines of code all set prog to a regular expression object which recognises a newline character:

如果您想要一个与换行符匹配的 Python 正则表达式对象,那么您需要一个 2 个字符的字符串,由反斜杠字符后跟n字符组成。以下代码行都将 prog 设置为识别换行符的正则表达式对象:

prog = re.compile(chr(92)+chr(110))
prog = re.compile("\n")
prog = re.compile(r"\n")

So why is it that "Usually patterns will be expressed in Python code using this raw string notation."? Because regular expressions are frequently static strings, which are conveniently represented as string literals. And from the different string literal notations available, raw strings are a convenient choice, when the regular expression includes a backslashcharacter.

那么为什么“通常模式将使用这种原始字符串表示法在 Python 代码中表示”。? 因为正则表达式通常是静态字符串,可以方便地表示为字符串文字。从可用的不同字符串文字符号中,当正则表达式包含反斜杠字符时,原始字符串是一个方便的选择。

Questions

问题

Q: what about the expression re.compile(r"\s\tWord")? A: It's easier to understand by separating the string from the regular expression compilation, and understanding them separately.

:那表情re.compile(r"\s\tWord")呢?A: 把字符串从正则表达式编译中分离出来,分别理解,比较容易理解。

s = r"\s\tWord"
prog = re.compile(s)

The string scontains eight characters: a backslash, an s, a backslash, a t, and then four characters Word.

该字符串s包含八个字符:一个反斜杠、一个s、一个反斜杠、一个t和四个字符Word

Q: What happens to the tab and space characters? A: At the Python language level, string sdoesn't have taband spacecharacter. It starts with four characters: backslash, s, backslash, t. The regular expression system, meanwhile, treats that string as source code in the regular expression language, where it means "match a string consisting of a whitespace character, a tab character, and the four characters Word.

:制表符和空格字符会发生什么变化?A: 在 Python 语言级别,字符串s没有制表符空格字符。它以四个字符开头:反斜杠s反斜杠t。同时,正则表达式系统将该字符串视为正则表达式语言中的源代码,这意味着“匹配由空格字符、制表符和四个字符组成的字符串Word

Q: How do you match those if that's being treated as backlash-s and backslash-t? A: Maybe the question is clearer if the words 'you' and 'that' are made more specific: how does the regular expression system match the expressions backlash-s and backslash-t? As 'any whitespace character' and as 'tabcharacter'.

:如果将其视为 backlash-s 和 backslash-t,您如何匹配它们?:如果把“你”和“那”这两个词做得更具体,也许问题会更清楚:正则表达式系统如何匹配表达式 backlash-s 和 backslash-t?作为“任何空白字符”和“制表符”。

Q: Or what if you have the 3-character string backslash-n-newline? A: In the Python language, the 3-character string backslash-n-newline can be represented as conventional string "\\n\n", or raw plus conventional string r"\n" "\n", or in other ways. The regular expression system matches the 3-character string backslash-n-newline when it finds any two consecutive newlinecharacters.

:或者如果您有 3 个字符的字符串反斜杠-n-换行符怎么办?A:在Python语言中,3个字符的字符串backslash-n-newline可以表示为常规字符串"\\n\n",或者raw加常规字符串r"\n" "\n",或者其他方式。正则表达式系统在找到任意两个连续的换行符时匹配 3 个字符的字符串反斜杠-n-换行符

N.B. All examples and document references are to Python 2.7.

注意所有示例和文档参考均针对 Python 2.7。

Update: Incorporated clarifications from answers of @Vladislav Zorov and @m.buettner, and from follow-up question of @Aerovistae.

更新:合并了@Vladislav Zorov 和@m.buettner 的回答以及@Aerovistae 的后续问题的澄清。

回答by Zarkonnen

The issue with using a normal string to write regexes that contain a \is that you end up having to write \\for every \. So the string literals "stuff\\things"and r"stuff\things"produce the same string. This gets especially useful if you want to write a regular expression that matches against backslashes.

使用普通字符串编写包含 a 的\正则表达式的问题是您最终必须\\为每个\. 所以字符串文字"stuff\\things"r"stuff\things"产生相同的字符串。如果您想编写与反斜杠匹配的正则表达式,这将特别有用。

Using normal strings, a regexp that matches the string \would be "\\\\"!

使用普通字符串,匹配字符串的正则表达式\将是"\\\\"!

Why? Because we have to escape \twice: once for the regular expression syntax, and once for the string syntax.

为什么?因为我们必须转义\两次:一次用于正则表达式语法,一次用于字符串语法。

You can use triple quotes to include newlines, like this:

您可以使用三重引号来包含换行符,如下所示:

r'''stuff\
things'''

Note that usually, python would treat \-newline as a line continuation, but this is not the case in raw strings. Also note that backslashes still escape quotes in raw strings, but are left in themselves. So the raw string literal r"\""produces the string \". This means you can't end a raw string literal with a backslash.

请注意,通常,python 会将\-newline 视为换行符,但在原始字符串中并非如此。另请注意,反斜杠仍会转义原始字符串中的引号,但会保留在其自身中。所以原始字符串文字r"\""产生 string \"。这意味着您不能以反斜杠结束原始字符串文字。

See the lexical analysis section of the Python documentationfor more information.

有关更多信息,请参阅Python 文档的词法分析部分

回答by Vladislav Zorov

You seem to be struggling with the idea that a RegEx isn't part of Python, but instead a different programming language with its own parser and compiler. Raw strings help you get the "source code" of a RegEx safely to the RegEx parser, which will then assign meaning to character sequences like \d, \w, \n, etc...

您似乎在为 RegEx 不是 Python 的一部分,而是一种具有自己的解析器和编译器的不同编程语言的想法而苦苦挣扎。原始字符串帮助你得到一个正则表达式的“源代码”安全的正则表达式解析器,它将然后分配含义的字符序列一样\d\w\n,等...

The issue exists because Python and RegExps use \as escape character, which is, by the way, a coincidence - there are languages with other escape characters (like "`n" for a newline, but even there you have to use "\n" in RegExps). The advantage is that you don't need to differentiate between raw and non-raw strings in these languages, they won't both try to convert the text and butcher it, because they react to different escape sequences.

问题存在是因为 Python 和 RegExps\用作转义字符,顺便说一句,这是一个巧合 - 有些语言带有其他转义字符(例如“`n”表示换行符,但即使在那里你也必须使用“\n”在正则表达式中)。优点是您不需要区分这些语言中的原始字符串和非原始字符串,它们不会同时尝试转换文本并对其进行处理,因为它们会对不同的转义序列做出反应。

回答by Geoff Gerrietts

Most of these questions have a lot of words in them and maybe it's hard to find the answer to your specific question.

这些问题中的大多数都包含很多单词,可能很难找到您的特定问题的答案。

If you use a regular string and you pass in a pattern like "\t" to the RegEx parser, Python will translate that literal into a buffer with the tab byte in it (0x09).

如果您使用常规字符串并将像“\t”这样的模式传递给 RegEx 解析器,Python 会将该文字转换为包含制表符字节 (0x09) 的缓冲区。

If you use a raw string and you pass in a pattern like r"\t" to the RegEx parser, Python does not do any interpretation, and it creates a buffer with two bytes in it: '\', and 't'. (0x5c, 0x74).

如果您使用原始字符串并将像 r"\t" 这样的模式传递给 RegEx 解析器,Python 不会进行任何解释,它会创建一个包含两个字节的缓冲区:'\' 和 't'。(0x5c,0x74)。

The RegEx parser knows what to do with the sequence '\t' -- it matches that against a tab. It also knows what to do with the 0x09 character -- that also matches a tab. For the most part, the results will be indistinguishable.

RegEx 解析器知道如何处理序列 '\t' —— 它与制表符匹配。它还知道如何处理 0x09 字符——它也匹配一个制表符。在大多数情况下,结果将是无法区分的。

So the key to understanding what's happening is recognizing that there are two parsers being employed here. The first one is the Python parser, and it translates your string literal (or raw string literal) into a sequence of bytes. The second one is Python's regular expression parser, and it converts a sequence of bytes into a compiled regular expression.

所以理解发生了什么的关键是认识到这里使用了两个解析器。第一个是 Python 解析器,它将您的字符串文字(或原始字符串文字)转换为字节序列。第二个是 Python 的正则表达式解析器,它将字节序列转换为编译后的正则表达式。

回答by Lorenzo Gatti

The relevant Python manual section ("String and Bytes literals") has a clear explanation of raw string literals:

相关的 Python 手册部分(“字符串和字节文字”)对原始字符串文字有清晰的解释:

Both string and bytes literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and treat backslashes as literal characters. As a result, in string literals, '\U' and '\u' escapes in raw strings are not treated specially. Given that Python 2.x's raw unicode literals behave differently than Python 3.x's the 'ur' syntax is not supported.

New in version 3.3: The 'rb' prefix of raw bytes literals has been added as a synonym of 'br'.

New in version 3.3: Support for the unicode legacy literal (u'value') was reintroduced to simplify the maintenance of dual Python 2.x and 3.x codebases. See PEP 414 for more information.

In triple-quoted strings, unescaped newlines and quotes are allowed (and are retained), except that three unescaped quotes in a row terminate the string. (A “quote” is the character used to open the string, i.e. either ' or ".)

Unless an 'r' or 'R' prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C. The recognized escape sequences are:

Escape Sequence Meaning Notes

\newline Backslash and newline ignored
\ Backslash ()
\' Single quote (')
\" Double quote (")
\a ASCII Bell (BEL)
\b ASCII Backspace (BS)
\f ASCII Formfeed (FF)
\n ASCII Linefeed (LF)
\r ASCII Carriage Return (CR)
\t ASCII Horizontal Tab (TAB) \v ASCII Vertical Tab (VT)
\ooo Character with octal value ooo (1,3)
\xhh Character with hex value hh (2,3)

Escape sequences only recognized in string literals are:

Escape Sequence Meaning Notes \N{name} Character named name in the Unicode database (4) \uxxxx Character with 16-bit hex value xxxx (5) \Uxxxxxxxx Character with 32-bit hex value xxxxxxxx (6)

Notes:

  1. As in Standard C, up to three octal digits are accepted.

  2. Unlike in Standard C, exactly two hex digits are required.

  3. In a bytes literal, hexadecimal and octal escapes denote the byte with the given value. In a string literal, these escapes denote a Unicode character with the given value.

  4. Changed in version 3.3: Support for name aliases [1] has been added.

  5. Individual code units which form parts of a surrogate pair can be encoded using this escape sequence. Exactly four hex digits are required.

  6. Any Unicode character can be encoded this way, but characters outside the Basic Multilingual Plane (BMP) will be encoded using a surrogate pair if Python is compiled to use 16-bit code units (the default). Exactly eight hex digits are required.

Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the string. (This behavior is useful when debugging: if an escape sequence is mistyped, the resulting output is more easily recognized as broken.) It is also important to note that the escape sequences only recognized in string literals fall into the category of unrecognized escapes for bytes literals.

Even in a raw string, string quotes can be escaped with a backslash, but the backslash remains in the string; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes). Specifically, a raw string cannot end in a single backslash (since the backslash would escape the following quote character). Note also that a single backslash followed by a newline is interpreted as those two characters as part of the string, not as a line continuation.

字符串和字节文字都可以选择以字母 'r' 或 'R' 为前缀;此类字符串称为原始字符串,并将反斜杠视为文字字符。因此,在字符串文字中,未对原始字符串中的 '\U' 和 '\u' 转义进行特殊处理。鉴于 Python 2.x 的原始 unicode 文字的行为与 Python 3.x 的不同,不支持“ur”语法。

3.3 新版功能:原始字节文字的“rb”前缀已添加为“br”的同义词。

3.3 版新功能:重新引入了对 unicode 旧文字 (u'value') 的支持,以简化双 Python 2.x 和 3.x 代码库的维护。有关更多信息,请参阅 PEP 414。

在三重引号字符串中,允许(并保留)未转义的换行符和引号,但连续三个未转义的引号终止字符串。(“引号”是用于打开字符串的字符,即 ' 或 "。)

除非存在 'r' 或 'R' 前缀,字符串中的转义序列将根据与标准 C 使用的规则类似的规则进行解释。 识别的转义序列是:

转义序列 含义 注释

\newline 忽略反斜杠和换行符
\ 反斜杠 ()
\' 单引号 (')
\" 双引号 (")
\a ASCII Bell (BEL)
\b ASCII Backspace (BS)
\f ASCII Formfeed (FF)
\n ASCII Linefeed ( LF)
\r ASCII 回车
符(CR) \t ASCII 水平制表符 (TAB) \v ASCII 垂直制表符 (VT)
\ooo 八进制值字符 ooo (1,3)
\xhh 十六进制值字符 hh (2,3)

仅在字符串文字中识别的转义序列是:

转义序列 含义 注释 \N{name} Unicode 数据库中名为 name 的字符 (4) \uxxxx 具有 16 位十六进制值的字符 xxxx (5) \Uxxxxxxxx 具有 32 位十六进制值的字符 xxxxxxxx (6)

笔记:

  1. 与标准 C 一样,最多接受三个八进制数字。

  2. 与标准 C 不同的是,正好需要两个十六进制数字。

  3. 在字节文字中,十六进制和八进制转义表示具有给定值的字节。在字符串文字中,这些转义符表示具有给定值的 Unicode 字符。

  4. 在 3.3 版更改: 添加了对名称别名 [1] 的支持。

  5. 可以使用此转义序列对构成代理对部分的各个代码单元进行编码。正好需要四个十六进制数字。

  6. 任何 Unicode 字符都可以用这种方式编码,但如果 Python 被编译为使用 16 位代码单元(默认值),则基本多语言平面 (BMP) 之外的字符将使用代理对进行编码。正好需要八个十六进制数字。

与标准 C 不同,所有无法识别的转义序列都保留在字符串中不变,即反斜杠保留在字符串中。(此行为在调试时很有用:如果转义序列输入错误,则结果输出更容易被识别为已损坏。)同样重要的是要注意,仅在字符串文字中识别的转义序列属于无法识别的字节转义类别文字。

即使在原始字符串中,字符串引号也可以用反斜杠转义,但反斜杠仍保留在字符串中;例如,r"\"" 是由两个字符组成的有效字符串文字:反斜杠和双引号;r"\" 不是有效的字符串文字(即使原始字符串也不能以奇数个反斜杠结尾)。具体来说,原始字符串不能以单个反斜杠结尾(因为反斜杠会转义后面的引号字符)还要注意,单个反斜杠后跟换行符被解释为这两个字符作为字符串的一部分,而不是作为行的延续.

回答by deeproyalblue

\nis an Escape Sequence in Python

\n是 Python 中的转义序列

\wis a Special Sequence in (Python) Regex

\w是 (Python) 正则表达式中的特殊序列

They look like they are in the same family but they are not. Raw string notation will affect Escape Sequences but not Regex Special Sequences.

他们看起来像是在同一个家庭中,但实际上并非如此。原始字符串表示法会影响转义序列,但不会影响正则表达式特殊序列。

For more about Escape Sequences search for "\newline" https://docs.python.org/3/reference/lexical_analysis.html

有关转义序列的更多信息,请搜索“\newline” https://docs.python.org/3/reference/lexical_analysis.html

For more about Special Sequences: search for "\number" https://docs.python.org/3/library/re.html

有关特殊序列的更多信息:搜索 "\number" https://docs.python.org/3/library/re.html