从python中的字符串中删除控制字符

Question

提问by David

I currently have the following code

我目前有以下代码

def removeControlCharacters(line):
    i = 0
    for c in line:
        if (c < chr(32)):
            line = line[:i - 1] + line[i+1:]
            i += 1
    return line

This is just does not work if there are more than one character to be deleted.

如果要删除的字符不止一个，这将不起作用。

Answer 1

采纳答案by Alex Quinn

There are hundredsof control characters in unicode. If you are sanitizing data from the web or some other source that might contain non-ascii characters, you will need Python's unicodedata module. The unicodedata.category(…)function returns the unicode category code(e.g., control character, whitespace, letter, etc.) of any character. For control characters, the category always starts with "C".

Unicode 中有数百个控制字符。如果您正在清理来自网络或其他可能包含非 ascii 字符的来源的数据，您将需要 Python 的unicodedata 模块。该unicodedata.category(…)函数返回任何字符的unicode 类别代码（例如，控制字符、空格、字母等）。对于控制字符，类别总是以“C”开头。

This snippet removes all control characters from a string.

此代码段从字符串中删除所有控制字符。

import unicodedata
def remove_control_characters(s):
    return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")

Examples of unicode categories:

Unicode 类别示例：

>>> from unicodedata import category
>>> category('\r')      # carriage return --> Cc : control character
'Cc'
>>> category('>>> mpa = dict.fromkeys(range(32))
>>> 'abcde'.translate(mpa)
'abcde'
')      # null character ---> Cc : control character
'Cc'
>>> category('\t')      # tab --------------> Cc : control character
'Cc'
>>> category(' ')       # space ------------> Zs : separator, space
'Zs'
>>> category(u'\u200A') # hair space -------> Zs : separator, space
'Zs'
>>> category(u'\u200b') # zero width space -> Cf : control character, formatting
'Cf'
>>> category('A')       # letter "A" -------> Lu : letter, uppercase
'Lu'
>>> category(u'\u4e21') # 両 ---------------> Lo : letter, other
'Lo'
>>> category(',')       # comma  -----------> Po : punctuation
'Po'
>>>

Answer 2

回答by SilentGhost

You could use str.translatewith the appropriate map, for example like this:

您可以使用str.translate适当的地图，例如像这样：

return ''.join(c for c in line if ord(c) >= 32)

Answer 3

回答by Mark Byers

Your implementation is wrong because the value of iis incorrect. However that's not the only problem: it also repeatedly uses slow string operations, meaning that it runs in O(n²) instead of O(n). Try this instead:

您的实现是错误的，因为的值i不正确。然而，这并不是唯一的问题：它还反复使用慢速字符串操作，这意味着它以 O(n ²) 而不是 O(n) 运行。试试这个：

filter(string.printable[:-5].__contains__,line)

Answer 4

回答by khachik

You modify the line during iterating over it. Something like ''.join([x for x in line if ord(x) >= 32])

您在迭代期间修改该行。就像是''.join([x for x in line if ord(x) >= 32])

Answer 5

回答by Kabie

import string
all_bytes = string.maketrans('', '')  # String of 256 characters with (byte) value 0 to 255

line.translate(all_bytes, all_bytes[:32])  # All bytes < 32 are deleted (the second argument lists the bytes to delete)

Answer 6

回答by Eric O Lebigot

And for Python 2, with the builtin translate:

对于 Python 2，使用内置translate：

>>> import unicodedata, re, sys
>>> all_chars = [chr(i) for i in range(sys.maxunicode)]
>>> control_chars = ''.join(c for c in all_chars if unicodedata.category(c) == 'Cc')
>>> expanded_class = ''.join(c for c in all_chars if re.match(r'[\x00-\x1f\x7f-\x9f]', c))
>>> control_chars == expanded_class
True

Answer 7

回答by AXO

Anyone interested in a regex character class that matches any Unicode control charactermay use [\x00-\x1f\x7f-\x9f].

任何对匹配任何 Unicode控制字符的正则表达式字符类感兴趣的人都可以使用[\x00-\x1f\x7f-\x9f].

You may test it like this:

你可以这样测试：

>>> re.sub(r'[\x00-\x1f\x7f-\x9f]', '', 'abcde')
'abcde'

So to remove the control characters using rejust use the following:

因此，要re使用以下命令删除控制字符：

pip install regex

import regex as re
def remove_control_characters(str):
    return re.sub(r'\p{C}', '', 'my-string')

Answer 8

回答by cmc

This is the easiest, most complete, and most robust way I am aware of. It does require an external dependency, however. I consider it to be worth it for most projects.

这是我所知道的最简单、最完整、最可靠的方法。但是，它确实需要外部依赖。我认为对于大多数项目来说这是值得的。

##代码##

\p{C}is the unicode character propertyfor control characters, so you can leave it up to the unicode consortium which ones of the millions of unicode characters available should be considered control. There are also other extremely useful character properties I frequently use, for example \p{Z}for any kind of whitespace.

\p{C}是控制字符的unicode 字符属性，因此您可以将它留给 unicode 联盟，数百万可用的 unicode 字符中的哪些应该被视为控制。我还经常使用其他非常有用的字符属性，例如\p{Z}用于任何类型的空格。

从python中的字符串中删除控制字符

提问by David

采纳答案by Alex Quinn

回答by SilentGhost

回答by Mark Byers

回答by khachik

回答by Kabie

回答by Eric O Lebigot

回答by AXO

回答by cmc

相关推荐

最近更新

标签

从python中的字符串中删除控制字符

提问by David

采纳答案by Alex Quinn

回答by SilentGhost

回答by Mark Byers

回答by khachik

回答by Kabie

回答by Eric O Lebigot

回答by AXO

回答by cmc

相关推荐

在python中将反斜杠转换为正斜杠

Python 按钮上的图像

将 UTF-8 转换为 ASCII 的 Python 脚本

Python：将字符串从 UTF-8 转换为 Latin-1

相关推荐

最近更新

标签