Python:替换重音符号(é 到 e)、删除 [^a-zA-Z\d\s] 和 lower() 的有效方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15261793/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:38:33  来源:igfitidea点击:

Python: efficient method to replace accents (é to e), remove [^a-zA-Z\d\s], and lower()

pythonregex

提问by oyra

Using Python 3.3. I want to do the following:

使用 Python 3.3。我想做以下事情:

  • replace special alphabetical characters such as e acute (é) and o circumflex (?) with the base character (? to o, for example)
  • remove all characters except alphanumeric and spaces in between alphanumeric characters
  • convert to lowercase
  • 用基本字符(例如,从 ?
  • 删除除字母数字和字母数字字符之间的空格以外的所有字符
  • 转换为小写

This is what I have so far:

这是我到目前为止:

mystring_modified = mystring.replace('\u00E9', 'e').replace('\u00F4', 'o').lower()
alphnumspace = re.compile(r"[^a-zA-Z\d\s]")
mystring_modified = alphnumspace.sub('', mystring_modified)

How can I improve this? Efficiency is a big concern, especially since I am currently performing the operations inside a loop:

我该如何改进?效率是一个大问题,特别是因为我目前正在循环内执行操作:

# Pseudocode
for mystring in myfile:
    mystring_modified = # operations described above
    mylist.append(mystring_modified)

The files in question are about 200,000 characters each.

有问题的文件每个大约有 200,000 个字符。

回答by John La Rooy

>>> import unicodedata
>>> s='é?'
>>> ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
'eo'

Also check out unidecode

另请查看unidecode

What Unidecode provides is a middle road: function unidecode() takes Unicode data and tries to represent it in ASCII characters (i.e., the universally displayable characters between 0x00 and 0x7F), where the compromises taken when mapping between two character sets are chosen to be near what a human with a US keyboard would choose.

The quality of resulting ASCII representation varies. For languages of western origin it should be between perfect and good. On the other hand transliteration (i.e., conveying, in Roman letters, the pronunciation expressed by the text in some other writing system) of languages like Chinese, Japanese or Korean is a very complex issue and this library does not even attempt to address it. It draws the line at context-free character-by-character mapping. So a good rule of thumb is that the further the script you are transliterating is from Latin alphabet, the worse the transliteration will be.

Note that this module generally produces better results than simply stripping accents from characters (which can be done in Python with built-in functions). It is based on hand-tuned character mappings that for example also contain ASCII approximations for symbols and non-Latin alphabets.

Unidecode 提供的是一条中间道路:函数 unidecode() 接受 Unicode 数据并尝试用 ASCII 字符(即 0x00 和 0x7F 之间的通用可显示字符)来表示它,其中在两个字符集之间映射时所采取的折衷选择是接近使用美式键盘的人会选择的东西。

结果 ASCII 表示的质量各不相同。对于起源于西方的语言,它应该介于完美和良好之间。另一方面,中文、日文或韩文等语言的音译(即,用罗马字母传达文本在某些其他书写系统中表达的发音)是一个非常复杂的问题,本图书馆甚至没有试图解决这个问题。它以无上下文的逐字符映射绘制线。因此,一个好的经验法则是,您音译的文字与拉丁字母的距离越远,音译效果就越差。

请注意,该模块通常比简单地从字符中去除重音产生更好的结果(这可以在 Python 中使用内置函数完成)。它基于手动调整的字符映射,例如还包含符号和非拉丁字母的 ASCII 近似值。

回答by unutbu

You could use str.translate:

你可以使用str.translate

import collections
import string

table = collections.defaultdict(lambda: None)
table.update({
    ord('é'):'e',
    ord('?'):'o',
    ord(' '):' ',
    ord('\N{NO-BREAK SPACE}'): ' ',
    ord('\N{EN SPACE}'): ' ',
    ord('\N{EM SPACE}'): ' ',
    ord('\N{THREE-PER-EM SPACE}'): ' ',
    ord('\N{FOUR-PER-EM SPACE}'): ' ',
    ord('\N{SIX-PER-EM SPACE}'): ' ',
    ord('\N{FIGURE SPACE}'): ' ',
    ord('\N{PUNCTUATION SPACE}'): ' ',
    ord('\N{THIN SPACE}'): ' ',
    ord('\N{HAIR SPACE}'): ' ',
    ord('\N{ZERO WIDTH SPACE}'): ' ',
    ord('\N{NARROW NO-BREAK SPACE}'): ' ',
    ord('\N{MEDIUM MATHEMATICAL SPACE}'): ' ',
    ord('\N{IDEOGRAPHIC SPACE}'): ' ',
    ord('\N{IDEOGRAPHIC HALF FILL SPACE}'): ' ',
    ord('\N{ZERO WIDTH NO-BREAK SPACE}'): ' ',
    ord('\N{TAG SPACE}'): ' ',
    })
table.update(dict(zip(map(ord,string.ascii_uppercase), string.ascii_lowercase)))
table.update(dict(zip(map(ord,string.ascii_lowercase), string.ascii_lowercase)))
table.update(dict(zip(map(ord,string.digits), string.digits)))

print('123 f?é BAR?'.translate(table,))

yields

产量

123 foe bar


On the down-side, you'll have to list all the special accented characters that you wish to translate. @gnibbler's method requires less coding.

不利的一面是,您必须列出要翻译的所有特殊重音字符。@gnibbler 的方法需要更少的编码。

On the up-side, the str.translatemethod should be fairly fast and it can handle all your requirements (downcasing, deleting and removing accents) in one function call once the tableis set up.

str.translate好的方面来说,该方法应该相当快,一旦设置,它就可以在一个函数调用中处理您的所有要求(缩小、删除和删除重音)table



By the way, a file with 200K characters is not very large. So it would be more efficient to read the entire file into a single str, then translate it in one function call.

顺便说一句,200K 字符的文件并不是很大。因此,将整个文件读入一个单一的文件str,然后在一个函数调用中翻译它会更有效。