Python 用一个空格替换非 ASCII 字符

Question

提问by dotancohen

I need to replace all non-ASCII (\x00-\x7F) characters with a space. I'm surprised that this is not dead-easy in Python, unless I'm missing something. The following function simply removes all non-ASCII characters:

我需要用空格替换所有非 ASCII (\x00-\x7F) 字符。我很惊讶这在 Python 中并不容易，除非我遗漏了一些东西。以下函数只是删除所有非 ASCII 字符：

def remove_non_ascii_1(text):

    return ''.join(i for i in text if ord(i)<128)

And this one replaces non-ASCII characters with the amount of spaces as per the amount of bytes in the character code point (i.e. the –character is replaced with 3 spaces):

并且这个将根据字符代码点中的字节数用空格量替换非 ASCII 字符（即–用 3 个空格替换字符）：

def remove_non_ascii_2(text):

    return re.sub(r'[^\x00-\x7F]',' ', text)

How can I replace all non-ASCII characters with a single space?

如何用一个空格替换所有非 ASCII 字符？

Of the myriad of similar SO questions, none address character replacement as opposed to stripping, and additionally address all non-ascii characters not a specific character.

的在无数的类似 SO 问题，无地址的字符替换为反对以剥离，并进一步解决所有非ASCII字符不是一个特定的字符。

Answer 1

采纳答案by Martijn Pieters

Your ''.join()expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

您的''.join()表达式正在过滤，删除任何非 ASCII 的内容；您可以改用条件表达式：

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

这将一个一个地处理字符，并且每个替换的字符仍然会使用一个空格。

Your regular expression should just replace consecutivenon-ASCII characters with a space:

您的正则表达式应该只用空格替换连续的非 ASCII 字符：

re.sub(r'[^\x00-\x7F]+',' ', text)

Note the +there.

注意+那里。

Answer 2

回答by Mark Tolonen

For characterprocessing, use Unicode strings:

对于字符处理，使用 Unicode 字符串：

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
>>> s='ABC马克def'
>>> import re
>>> re.sub(r'[^\x00-\x7f]',r' ',s)   # Each char is a Unicode codepoint.
'ABC  def'
>>> b = s.encode('utf8')
>>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.
b'ABC      def'

But note you will still have a problem if your string contains decomposed Unicode characters (separate character and combining accent marks, for example):

但请注意，如果您的字符串包含分解的 Unicode 字符（例如，单独的字符和组合重音符号），您仍然会遇到问题：

>>> s = 'ma?ana'
>>> len(s)
6
>>> import unicodedata as ud
>>> n=ud.normalize('NFD',s)
>>> n
'man?ana'
>>> len(n)
7
>>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint
'ma ana'
>>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced
'man ana'

Answer 3

回答by Alvaro Fuentes

For you the get the most alike representation of your original string I recommend the unidecode module:

为了获得与原始字符串最相似的表示，我推荐使用 unidecode 模块：

from unidecode import unidecode
def remove_non_ascii(text):
    return unidecode(unicode(text, encoding = "utf-8"))

Then you can use it in a string:

然后你可以在字符串中使用它：

remove_non_ascii("Ce?ía")
Cenia

Answer 4

回答by parsecer

What about this one?

这个如何？

def replace_trash(unicode_string):
     for i in range(0, len(unicode_string)):
         try:
             unicode_string[i].encode("ascii")
         except:
              #means it's non-ASCII
              unicode_string=unicode_string[i].replace(" ") #replacing it with a single space
     return unicode_string

Answer 5

回答by AXO

If the replacement character can be '?' instead of a space, then I'd suggest result = text.encode('ascii', 'replace').decode():

如果替换字符可以是 '?' 而不是一个空格，那么我建议result = text.encode('ascii', 'replace').decode()：

"""Test the performance of different non-ASCII replacement methods."""


import re
from timeit import timeit


# 10_000 is typical in the project that I'm working on and most of the text
# is going to be non-ASCII.
text = '?' * 10_000


print(timeit(
    """
result = ''.join([c if ord(c) < 128 else '?' for c in text])
    """,
    number=1000,
    globals=globals(),
))

print(timeit(
    """
result = text.encode('ascii', 'replace').decode()
    """,
    number=1000,
    globals=globals(),
))

Results:

结果：

0.7208260721400134
0.009975979187503592

Answer 6

回答by Kasramvd

As a native and efficient approach, you don't need to use ordor any loop over the characters. Just encode with asciiand ignore the errors.

作为一种原生且高效的方法，您不需要使用ord字符或对字符进行任何循环。只需编码ascii并忽略错误。

The following will just remove the non-ascii characters:

以下将仅删除非 ascii 字符：

new_string = old_string.encode('ascii',errors='ignore')

Now if you want to replace the deleted characters just do the following:

现在，如果您想替换已删除的字符，请执行以下操作：

final_string = new_string + b' ' * (len(old_string) - len(new_string))

Answer 7

回答by seaders

Potentially for a different question, but I'm providing my version of @Alvero's answer (using unidecode). I want to do a "regular" strip on my strings, i.e. the beginning and end of my string for whitespace characters, and then replace only other whitespace characters with a "regular" space, i.e.

可能针对不同的问题，但我提供了我的@Alvero 答案版本（使用 unidecode）。我想在我的字符串上做一个“常规”条带，即空格字符的字符串的开头和结尾，然后只用“常规”空格替换其他空格字符，即

"Ce?ía?ma?ana????"

to

到

"Ce?ía ma?ana"

,

def safely_stripped(s: str):
    return ' '.join(
        stripped for stripped in
        (bit.strip() for bit in
         ''.join((c if unidecode(c) else ' ') for c in s).strip().split())
        if stripped)

We first replace all non-unicode spaces with a regular space (and join it back again),

我们首先用一个普通的空格替换所有非 unicode 的空格（然后再次加入），

''.join((c if unidecode(c) else ' ') for c in s)

And then we split that again, with python's normal split, and strip each "bit",

然后我们再次拆分，使用python的正常拆分，并剥离每个“位”，

(bit.strip() for bit in s.split())

And lastly join those back again, but only if the string passes an iftest,

最后再次加入那些，但前提是字符串通过了if测试，

' '.join(stripped for stripped in s if stripped)

And with that, safely_stripped('????Ce?ía?ma?ana????')correctly returns 'Ce?ía ma?ana'.

有了这个，safely_stripped('????Ce?ía?ma?ana????')正确返回'Ce?ía ma?ana'.

Python 用一个空格替换非 ASCII 字符

提问by dotancohen

采纳答案by Martijn Pieters

回答by Mark Tolonen

回答by Alvaro Fuentes

回答by parsecer

回答by AXO

回答by Kasramvd

回答by seaders

相关推荐

最近更新

标签

Python 用一个空格替换非 ASCII 字符

提问by dotancohen

采纳答案by Martijn Pieters

回答by Mark Tolonen

回答by Alvaro Fuentes

回答by parsecer

回答by AXO

回答by Kasramvd

回答by seaders

相关推荐

Python 将多个 JSON 记录读入 Pandas 数据帧

Python 熊猫重新索引数据框问题

如何抓取需要先用 Python 登录的网站

Python 多处理的池进程限制

相关推荐

最近更新

标签