在 Python 中处理字符串中的转义序列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4020539/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 13:53:37  来源:igfitidea点击:

Process escape sequences in a string in Python

pythonstringescaping

提问by dln385

Sometimes when I get input from a file or the user, I get a string with escape sequences in it. I would like to process the escape sequences in the same way that Python processes escape sequences in string literals.

有时,当我从文件或用户获取输入时,会得到一个包含转义序列的字符串。我想以与 Python 处理字符串文字中的转义序列相同的方式处理转义序列

For example, let's say myStringis defined as:

例如,假设myString定义为:

>>> myString = "spam\neggs"
>>> print(myString)
spam\neggs

I want a function (I'll call it process) that does this:

我想要一个process执行此操作的函数(我将称之为):

>>> print(process(myString))
spam
eggs

It's important that the function can process all of the escape sequences in Python (listed in a table in the link above).

重要的是该函数可以处理 Python 中的所有转义序列(在上面链接的表格中列出)。

Does Python have a function to do this?

Python 是否具有执行此操作的功能?

采纳答案by Jerub

The correct thing to do is use the 'string-escape' code to decode the string.

正确的做法是使用“字符串转义”代码来解码字符串。

>>> myString = "spam\neggs"
>>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3 
>>> decoded_string = myString.decode('string_escape') # python2
>>> print(decoded_string)
spam
eggs

Don't use the AST or eval. Using the string codecs is much safer.

不要使用 AST 或 eval。使用字符串编解码器要安全得多。

回答by Greg Hewgill

The ast.literal_evalfunction comes close, but it will expect the string to be properly quoted first.

ast.literal_eval函数很接近,但它希望字符串首先被正确引用。

Of course Python's interpretation of backslash escapes depends on how the string is quoted (""vs r""vs u"", triple quotes, etc) so you may want to wrap the user input in suitable quotes and pass to literal_eval. Wrapping it in quotes will also prevent literal_evalfrom returning a number, tuple, dictionary, etc.

当然,Python 对反斜杠转义的解释取决于字符串的引用方式(""vs r""vs u""、三重引号等),因此您可能希望将用户输入用合适的引号括起来并传递给literal_eval. 将它用引号括起来还可以防止literal_eval返回数字、元组、字典等。

Things still might get tricky if the user types unquoted quotes of the type you intend to wrap around the string.

如果用户键入您打算环绕字符串的类型的不带引号的引号,事情仍然可能会变得棘手。

回答by rspeer

unicode_escapedoesn't work in general

unicode_escape一般不起作用

It turns out that the string_escapeor unicode_escapesolution does not work in general -- particularly, it doesn't work in the presence of actual Unicode.

事实证明string_escapeorunicode_escape解决方案通常不起作用 - 特别是,它在实际 Unicode 存在的情况下不起作用。

If you can be sure that everynon-ASCII character will be escaped (and remember, anything beyond the first 128 characters is non-ASCII), unicode_escapewill do the right thing for you. But if there are any literal non-ASCII characters already in your string, things will go wrong.

如果您可以确定每个非 ASCII 字符都将被转义(请记住,前 128 个字符以外的任何字符都是非 ASCII 字符),unicode_escape这对您来说是正确的。但是如果你的字符串中已经有任何非 ASCII 文字字符,事情就会出错。

unicode_escapeis fundamentally designed to convert bytes into Unicode text. But in many places -- for example, Python source code -- the source data is already Unicode text.

unicode_escape基本上旨在将字节转换为 Unicode 文本。但在许多地方——例如,Python 源代码——源数据已经是 Unicode 文本。

The only way this can work correctly is if you encode the text into bytes first. UTF-8 is the sensible encoding for all text, so that should work, right?

这可以正常工作的唯一方法是首先将文本编码为字节。UTF-8 是所有文本的合理编码,所以应该可行,对吗?

The following examples are in Python 3, so that the string literals are cleaner, but the same problem exists with slightly different manifestations on both Python 2 and 3.

以下示例在 Python 3 中,因此字符串文字更清晰,但同样的问题存在于 Python 2 和 3 中,表现形式略有不同。

>>> s = 'na?ve \t test'
>>> print(s.encode('utf-8').decode('unicode_escape'))
na?ˉve   test

Well, that's wrong.

嗯,这是错误的。

The new recommended way to use codecs that decode text into text is to call codecs.decodedirectly. Does that help?

使用将文本解码为文本的编解码器的新推荐方法是codecs.decode直接调用。这有帮助吗?

>>> import codecs
>>> print(codecs.decode(s, 'unicode_escape'))
na?ˉve   test

Not at all. (Also, the above is a UnicodeError on Python 2.)

一点也不。(此外,以上是 Python 2 上的 UnicodeError。)

The unicode_escapecodec, despite its name, turns out to assume that all non-ASCII bytes are in the Latin-1 (ISO-8859-1) encoding. So you would have to do it like this:

unicode_escape编解码器,尽管它的名字,原来假设所有非ASCII字节拉丁-1(ISO-8859-1)编码。所以你必须这样做:

>>> print(s.encode('latin-1').decode('unicode_escape'))
na?ve    test

But that's terrible. This limits you to the 256 Latin-1 characters, as if Unicode had never been invented at all!

但这太可怕了。这将您限制为 256 个 Latin-1 字符,就好像 Unicode 从未被发明过一样!

>>> print('Ern? \t Rubik'.encode('latin-1').decode('unicode_escape'))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151'
in position 3: ordinal not in range(256)

Adding a regular expression to solve the problem

添加正则表达式解决问题

(Surprisingly, we do not now have two problems.)

(令人惊讶的是,我们现在没有两个问题。)

What we need to do is only apply the unicode_escapedecoder to things that we are certain to be ASCII text. In particular, we can make sure only to apply it to valid Python escape sequences, which are guaranteed to be ASCII text.

我们需要做的只是将unicode_escape解码器应用于我们确定是 ASCII 文本的东西。特别是,我们可以确保仅将其应用于有效的 Python 转义序列,这些序列保证是 ASCII 文本。

The plan is, we'll find escape sequences using a regular expression, and use a function as the argument to re.subto replace them with their unescaped value.

计划是,我们将使用正则表达式找到转义序列,并使用一个函数作为参数re.sub来将它们替换为未转义的值。

import re
import codecs

ESCAPE_SEQUENCE_RE = re.compile(r'''
    ( \U........      # 8-digit hex escapes
    | \u....          # 4-digit hex escapes
    | \x..            # 2-digit hex escapes
    | \[0-7]{1,3}     # Octal escapes
    | \N\{[^}]+\}     # Unicode characters by name
    | \[\'"abfnrtv]  # Single-character escapes
    )''', re.UNICODE | re.VERBOSE)

def decode_escapes(s):
    def decode_match(match):
        return codecs.decode(match.group(0), 'unicode-escape')

    return ESCAPE_SEQUENCE_RE.sub(decode_match, s)

And with that:

有了这个:

>>> print(decode_escapes('Ern? \t Rubik'))
Ern?     Rubik

回答by user19087

The actually correct and convenient answer for python 3:

python 3的实际正确和方便的答案:

>>> import codecs
>>> myString = "spam\neggs"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
spam
eggs
>>> myString = "na?ve \t test"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
na?ve    test

Details regarding codecs.escape_decode:

有关的详细信息codecs.escape_decode

  • codecs.escape_decodeis a bytes-to-bytes decoder
  • codecs.escape_decodedecodes ascii escape sequences, such as: b"\\n"-> b"\n", b"\\xce"-> b"\xce".
  • codecs.escape_decodedoes not care or need to know about the byte object's encoding, but the encoding of the escaped bytes should match the encoding of the rest of the object.
  • codecs.escape_decode是字节到字节的解码器
  • codecs.escape_decode解码 ascii 转义序列,例如:b"\\n"-> b"\n", b"\\xce"-> b"\xce"
  • codecs.escape_decode不关心或不需要知道字节对象的编码,但转义字节的编码应该与对象其余部分的编码相匹配。

Background:

背景:

  • @rspeeris correct: unicode_escapeis the incorrect solution for python3. This is because unicode_escapedecodes escaped bytes, then decodes bytes to unicode string, but receives no information regarding which codec to use for the second operation.
  • @Jerubis correct: avoid the AST or eval.
  • I first discovered codecs.escape_decodefrom this answer to "how do I .decode('string-escape') in Python3?". As that answer states, that function is currently not documented for python 3.
  • @rspeer是正确的:unicode_escape是 python3 的错误解决方案。这是因为unicode_escape解码转义字节,然后将字节解码为 un​​icode 字符串,但没有收到有关第二个操作使用哪个编解码器的信息。
  • @Jerub是正确的:避免使用 AST 或 eval。
  • 我首先codecs.escape_decode这个答案中发现了“如何在 Python3 中 .decode('string-escape')?” . 正如该答案所述,该函数目前未针对 python 3 进行记录。

回答by Vignesh Ramsubbose

Below code should work for \n is required to be displayed on the string.

下面的代码应该适用于 \n 需要显示在字符串上。

import string

our_str = 'The String is \n, \n and \n!'
new_str = string.replace(our_str, '/\n', '/\n', 1)
print(new_str)

回答by LimeTr33

This is a bad way of doing it, but it worked for me when trying to interpret escaped octals passed in a string argument.

这是一种糟糕的做法,但在尝试解释在字符串参数中传递的转义八进制时,它对我有用。

input_string = eval('b"' + sys.argv[1] + '"')

It's worth mentioning that there is a difference between eval and ast.literal_eval (eval being way more unsafe). See Using python's eval() vs. ast.literal_eval()?

值得一提的是, eval 和 ast.literal_eval 之间存在差异(eval 更不安全)。请参阅使用 python 的 eval() 与 ast.literal_eval()?