如何从python中的字符串中删除ANSI转义序列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14693701/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 12:10:25  来源:igfitidea点击:

How can I remove the ANSI escape sequences from a string in python

pythonstringescapingansi-escape

提问by SpartaSixZero

This is my string:

这是我的字符串:

'ls\r\n\x1b[00m\x1b[01;31mexamplefile.zip\x1b[00m\r\n\x1b[01;31m'

I was using code to retrieve the output from a SSH command and I want my string to only contain 'examplefile.zip'

我正在使用代码从 SSH 命令检索输出,我希望我的字符串只包含“examplefile.zip”

What I can use to remove the extra escape sequences?

我可以用什么来删除额外的转义序列?

采纳答案by Martijn Pieters

Delete them with a regular expression:

使用正则表达式删除它们:

import re

# 7-bit C1 ANSI sequences
ansi_escape = re.compile(r'''
    \x1B  # ESC
    (?:   # 7-bit C1 Fe (except CSI)
        [@-Z\-_]
    |     # or [ for CSI, followed by a control sequence
        \[
        [0-?]*  # Parameter bytes
        [ -/]*  # Intermediate bytes
        [@-~]   # Final byte
    )
''', re.VERBOSE)
result = ansi_escape.sub('', sometext)

or, without the VERBOSEflag, in condensed form:

或者,没有VERBOSE标志,以浓缩形式:

ansi_escape = re.compile(r'\x1B(?:[@-Z\-_]|\[[0-?]*[ -/]*[@-~])')
result = ansi_escape.sub('', sometext)

Demo:

演示:

>>> import re
>>> ansi_escape = re.compile(r'\x1B(?:[@-Z\-_]|\[[0-?]*[ -/]*[@-~])')
>>> sometext = 'ls\r\n\x1b[00m\x1b[01;31mexamplefile.zip\x1b[00m\r\n\x1b[01;31m'
>>> ansi_escape.sub('', sometext)
'ls\r\nexamplefile.zip\r\n'

The above regular expression covers all 7-bit ANSI C1 escape sequences, but notthe 8-bit C1 escape sequence openers. The latter are never used in today's UTF-8 world where the same range of bytes have a different meaning.

上述正则表达式涵盖了所有 7 位 ANSI C1 转义序列,但包括 8 位 C1 转义序列开启器。后者在今天的 UTF-8 世界中从未使用过,其中相同的字节范围具有不同的含义。

If you do need to cover the 8-bit codes too (and are then, presumably, working with bytesvalues) then the regular expression becomes a bytes pattern like this:

如果你也需要覆盖 8 位代码(然后,大概是使用bytes值),那么正则表达式就变成了这样的字节模式:

# 7-bit and 8-bit C1 ANSI sequences
ansi_escape_8bit = re.compile(br'''
    (?: # either 7-bit C1, two bytes, ESC Fe (omitting CSI)
        \x1B
        [@-Z\-_]
    |   # or a single 8-bit byte Fe (omitting CSI)
        [\x80-\x9A\x9C-\x9F]
    |   # or CSI + control codes
        (?: # 7-bit CSI, ESC [ 
            \x1B\[
        |   # 8-bit CSI, 9B
            \x9B
        )
        [0-?]*  # Parameter bytes
        [ -/]*  # Intermediate bytes
        [@-~]   # Final byte
    )
''', re.VERBOSE)
result = ansi_escape_8bit.sub(b'', somebytesvalue)

which can be condensed down to

可以浓缩为

# 7-bit and 8-bit C1 ANSI sequences
ansi_escape_8bit = re.compile(
    br'(?:\x1B[@-Z\-_]|[\x80-\x9A\x9C-\x9F]|(?:\x1B\[|\x9B)[0-?]*[ -/]*[@-~])'
)
result = ansi_escape_8bit.sub(b'', somebytesvalue)

For more information, see:

有关更多信息,请参阅:

The example you gave contains 4 CSI (Control Sequence Introducer) codes, as marked by the \x1B[or ESC [opening bytes, and each contains a SGR (Select Graphic Rendition) code, because they each end in m. The parameters (separated by ;semicolons) in between those tell your terminal what graphic rendition attributes to use. So for each \x1B[....msequence, the 3 codes that are used are:

您给出的示例包含 4 个 CSI(控制序列引入器)代码,由\x1B[ESC[开头字节标记,每个代码都包含一个 SGR(选择图形再现)代码,因为它们都以m. 之间的参数(用;分号分隔)告诉您的终端要使用哪些图形再现属性。因此,对于每个\x1B[....m序列,使用的 3 个代码是:

  • 0 (or 00in this example): reset, disable all attributes
  • 1 (or 01in the example): bold
  • 31: red(foreground)
  • 0(或00在本例中):重置,禁用所有属性
  • 1(或01在示例中):粗体
  • 31:红色(前景)

However, there is more to ANSI than just CSI SGR codes. With CSI alone you can also control the cursor, clear lines or the whole display, or scroll (provided the terminal supports this of course). And beyond CSI, there are codes to select alternative fonts (SS2and SS3), to send 'private messages' (think passwords), to communicate with the terminal (DCS), the OS (OSC), or the application itself (APC, a way for applications to piggy-back custom control codes on to the communication stream), and further codes to help define strings (SOS, Start of String, STString Terminator) or to reset everything back to a base state (RIS). The above regexes cover all of these.

然而,ANSI 不仅仅是 CSI SGR 代码。单独使用 CSI,您还可以控制光标、清除线条或整个显示,或滚动(当然前提是终端支持)。超越CSI,有代码来选择替代字体(SS2SS3),发送“悄悄话”(认为密码),与终端(通信DCS),操作系统(OSC),或应用程序本身(APC,这是一种为应用程序将自定义控制代码附加到通信流中),以及帮助定义字符串(SOS、字符串开头、ST字符串终止符)或将所有内容重置为基本状态 ( RIS) 的其他代码。上面的正则表达式涵盖了所有这些。

Note that the above regex only removes the ANSI C1 codes, however, and not any additional data that those codes may be marking up (such as the strings sent between an OSC opener and the terminating ST code). Removing those would require additional work outside the scope of this answer.

请注意,上面的正则表达式仅删除了 ANSI C1 代码,而不会删除这些代码可能标记的任何其他数据(例如在 OSC 开启程序和终止 ST 代码之间发送的字符串)。删除这些将需要本答案范围之外的额外工作。

回答by Neodied

if you want to remove the \r\nbit, you can pass the string through this function (written by sarnold):

如果要删除该\r\n位,可以通过此函数(由 sarnold 编写)传递字符串:

def stripEscape(string):
    """ Removes all escape sequences from the input string """
    delete = ""
    i=1
    while (i<0x20):
        delete += chr(i)
        i += 1
    t = string.translate(None, delete)
    return t

Careful though, this will lump together the text in front and behind the escape sequences. So, using Martijn's filtered string 'ls\r\nexamplefile.zip\r\n', you will get lsexamplefile.zip. Note the lsin front of the desired filename.

不过要小心,这会将转义序列前后的文本混为一谈。因此,使用 Martijn 的过滤字符串'ls\r\nexamplefile.zip\r\n',您将获得lsexamplefile.zip. 请注意ls所需文件名前面的 。

I would use the stripEscape function first to remove the escape sequences, then pass the output to Martijn's regular expression, which would avoid concatenating the unwanted bit.

我将首先使用 stripEscape 函数删除转义序列,然后将输出传递给 Martijn 的正则表达式,这将避免连接不需要的位。

回答by Jeff

The accepted answer to this question only considers color and font effects. There are a lot of sequences that do not end in 'm', such as cursor positioning, erasing, and scroll regions.

这个问题的公认答案只考虑颜色和字体效果。有很多不以'm'结尾的序列,例如光标定位、擦除和滚动区域。

The complete regexp for Control Sequences (aka ANSI Escape Sequences) is

控制序列(又名 ANSI 转义序列)的完整正则表达式是

/(\x9B|\x1B\[)[0-?]*[ -\/]*[@-~]/

Refer to ECMA-48 Section 5.4and ANSI escape code

参考ECMA-48 第 5.4 节ANSI 转义码

回答by édouard Lopez

Function

功能

Based on Martijn Pieters?'s answerwith Jeff's regexp.

基于Martijn Pieters?'s answerwith Jeff's regexp

def escape_ansi(line):
    ansi_escape = re.compile(r'(?:\x1B[@-_]|[\x80-\x9F])[0-?]*[ -/]*[@-~]')
    return ansi_escape.sub('', line)

Test

测试

def test_remove_ansi_escape_sequence(self):
    line = '\t\u001b[0;35mBlabla\u001b[0m                                  \u001b[0;36m172.18.0.2\u001b[0m'

    escaped_line = escape_ansi(line)

    self.assertEqual(escaped_line, '\tBlabla                                  172.18.0.2')


Testing

测试

If you want to run it by yourself, use python3(better unicode support, blablabla). Here is how the test file should be:

如果您想自己运行它,请使用python3(更好的 unicode 支持,blablabla)。测试文件应该是这样的:

import unittest
import re

def escape_ansi(line):
    …

class TestStringMethods(unittest.TestCase):
    def test_remove_ansi_escape_sequence(self):
    …

if __name__ == '__main__':
    unittest.main()

回答by kfir

The suggested regex didn't do the trick for me so I created one of my own. The following is a python regex that I created based on the spec found here

建议的正则表达式对我不起作用,所以我创建了自己的正则表达式。以下是我根据此处找到的规范创建的 python 正则表达式

ansi_regex = r'\x1b(' \
             r'(\[\??\d+[hl])|' \
             r'([=<>a-kzNM78])|' \
             r'([\(\)][a-b0-2])|' \
             r'(\[\d{0,2}[ma-dgkjqi])|' \
             r'(\[\d+;\d+[hfy]?)|' \
             r'(\[;?[hf])|' \
             r'(#[3-68])|' \
             r'([01356]n)|' \
             r'(O[mlnp-z]?)|' \
             r'(/Z)|' \
             r'(\d+)|' \
             r'(\[\?\d;\d0c)|' \
             r'(\d;\dR))'
ansi_escape = re.compile(ansi_regex, flags=re.IGNORECASE)

I tested my regex on the following snippet (basically a copy paste from the ascii-table.com page)

我在以下代码段上测试了我的正则表达式(基本上是来自 ascii-table.com 页面的复制粘贴)

\x1b[20h    Set
\x1b[?1h    Set
\x1b[?3h    Set
\x1b[?4h    Set
\x1b[?5h    Set
\x1b[?6h    Set
\x1b[?7h    Set
\x1b[?8h    Set
\x1b[?9h    Set
\x1b[20l    Set
\x1b[?1l    Set
\x1b[?2l    Set
\x1b[?3l    Set
\x1b[?4l    Set
\x1b[?5l    Set
\x1b[?6l    Set
\x1b[?7l    Reset
\x1b[?8l    Reset
\x1b[?9l    Reset
\x1b=   Set
\x1b>   Set
\x1b(A  Set
\x1b)A  Set
\x1b(B  Set
\x1b)B  Set
\x1b(0  Set
\x1b)0  Set
\x1b(1  Set
\x1b)1  Set
\x1b(2  Set
\x1b)2  Set
\x1bN   Set
\x1bO   Set
\x1b[m  Turn
\x1b[0m Turn
\x1b[1m Turn
\x1b[2m Turn
\x1b[4m Turn
\x1b[5m Turn
\x1b[7m Turn
\x1b[8m Turn
\x1b[1;2    Set
\x1b[1A Move
\x1b[2B Move
\x1b[3C Move
\x1b[4D Move
\x1b[H  Move
\x1b[;H Move
\x1b[4;3H   Move
\x1b[f  Move
\x1b[;f Move
\x1b[1;2    Move
\x1bD   Move/scroll
\x1bM   Move/scroll
\x1bE   Move
\x1b7   Save
\x1b8   Restore
\x1bH   Set
\x1b[g  Clear
\x1b[0g Clear
\x1b[3g Clear
\x1b#3  Double-height
\x1b#4  Double-height
\x1b#5  Single
\x1b#6  Double
\x1b[K  Clear
\x1b[0K Clear
\x1b[1K Clear
\x1b[2K Clear
\x1b[J  Clear
\x1b[0J Clear
\x1b[1J Clear
\x1b[2J Clear
\x1b5n  Device
\x1b0n  Response:
\x1b3n  Response:
\x1b6n  Get
\x1b[c  Identify
\x1b[0c Identify
\x1b[?1;20c Response:
\x1bc   Reset
\x1b#8  Screen
\x1b[2;1y   Confidence
\x1b[2;2y   Confidence
\x1b[2;9y   Repeat
\x1b[2;10y  Repeat
\x1b[0q Turn
\x1b[1q Turn
\x1b[2q Turn
\x1b[3q Turn
\x1b[4q Turn
\x1b<   Enter/exit
\x1b=   Enter
\x1b>   Exit
\x1bF   Use
\x1bG   Use
\x1bA   Move
\x1bB   Move
\x1bC   Move
\x1bD   Move
\x1bH   Move
\x1b12  Move
\x1bI  
\x1bK  
\x1bJ  
\x1bZ  
\x1b/Z 
\x1bOP 
\x1bOQ 
\x1bOR 
\x1bOS 
\x1bA  
\x1bB  
\x1bC  
\x1bD  
\x1bOp 
\x1bOq 
\x1bOr 
\x1bOs 
\x1bOt 
\x1bOu 
\x1bOv 
\x1bOw 
\x1bOx 
\x1bOy 
\x1bOm 
\x1bOl 
\x1bOn 
\x1bOM 
\x1b[i 
\x1b[1i
\x1b[4i
\x1b[5i

Hopefully this will help others :)

希望这会帮助其他人:)

回答by Rory

If it helps future Stack Overflowers, I was using the crayons libraryto give my Python output a bit more visual impact, which is advantageous as it works on both Windows and Linux platforms. However I was both displaying onscreen as well as appending to log files, and the escape sequences were impacting legibility of the log files, so wanted to strip them out. However the escape sequences inserted by crayons produced an error:

如果它对未来的 Stack Overflowers 有帮助,我会使用crayons 库来为我的 Python 输出提供更多的视觉冲击,这是有利的,因为它适用于 Windows 和 Linux 平台。然而,我既在屏幕上显示,又在日志文件中附加,转义序列影响了日志文件的易读性,所以想把它们去掉。然而,蜡笔插入的转义序列产生了一个错误:

expected string or bytes-like object

The solution was to cast the parameter to a string, so only a tiny modification to the commonly accepted answer was needed:

解决方案是将参数转换为字符串,因此只需要对普遍接受的答案进行微小修改:

def escape_ansi(line):
    ansi_escape = re.compile(r'(\x9B|\x1B\[)[0-?]*[ -/]*[@-~]')
    return ansi_escape.sub('', str(line))