在 Python 中使用 string.translate 来音译西里尔文?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14173421/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
use string.translate in Python to transliterate Cyrillic?
提问by Nemoden
I'm getting UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-51: ordinal not in range(128)exception trying to use string.maketransin Python. I'm kinda discouraged with this kind of error in following code (gist):
我得到UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-51: ordinal not in range(128)异常尝试使用string.maketrans在Python。我对以下代码(gist)中的这种错误感到沮丧:
# -*- coding: utf-8 -*-
import string
def translit1(string):
""" This function works just fine """
capital_letters = {
u'А': u'A',
u'Б': u'B',
u'В': u'V',
u'Г': u'G',
u'Д': u'D',
u'Е': u'E',
u'Ё': u'E',
u'Ж': u'Zh',
u'З': u'Z',
u'И': u'I',
u'Й': u'Y',
u'К': u'K',
u'Л': u'L',
u'М': u'M',
u'Н': u'N',
u'О': u'O',
u'П': u'P',
u'Р': u'R',
u'С': u'S',
u'Т': u'T',
u'У': u'U',
u'Ф': u'F',
u'Х': u'H',
u'Ц': u'Ts',
u'Ч': u'Ch',
u'Ш': u'Sh',
u'Щ': u'Sch',
u'Ъ': u'',
u'Ы': u'Y',
u'Ь': u'',
u'Э': u'E',
u'Ю': u'Yu',
u'Я': u'Ya'
}
lower_case_letters = {
u'а': u'a',
u'б': u'b',
u'в': u'v',
u'г': u'g',
u'д': u'd',
u'е': u'e',
u'ё': u'e',
u'ж': u'zh',
u'з': u'z',
u'и': u'i',
u'й': u'y',
u'к': u'k',
u'л': u'l',
u'м': u'm',
u'н': u'n',
u'о': u'o',
u'п': u'p',
u'р': u'r',
u'с': u's',
u'т': u't',
u'у': u'u',
u'ф': u'f',
u'х': u'h',
u'ц': u'ts',
u'ч': u'ch',
u'ш': u'sh',
u'щ': u'sch',
u'ъ': u'',
u'ы': u'y',
u'ь': u'',
u'э': u'e',
u'ю': u'yu',
u'я': u'ya'
}
translit_string = ""
for index, char in enumerate(string):
if char in lower_case_letters.keys():
char = lower_case_letters[char]
elif char in capital_letters.keys():
char = capital_letters[char]
if len(string) > index+1:
if string[index+1] not in lower_case_letters.keys():
char = char.upper()
else:
char = char.upper()
translit_string += char
return translit_string
def translit2(text):
""" This method should be more easy to grasp,
but throws exception:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-51: ordinal not in range(128)
"""
symbols = string.maketrans(u"абвгдеёзийклмнопрстуфхъыьэАБВГДЕЁЗИЙКЛМНОПРСТУФХЪЫЬЭ",
u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E")
sequence = {
u'ж':'zh',
u'ц':'ts',
u'ч':'ch',
u'ш':'sh',
u'щ':'sch',
u'ю':'ju',
u'я':'ja',
u'Ж':'Zh',
u'Ц':'Ts',
u'Ч':'Ch'
}
for char in sequence.keys():
text = text.replace(char, sequence[char])
return text.translate(symbols)
if __name__ == "__main__":
print translit1(u"Привет") # prints Privet as expected
print translit2(u"Привет") # throws exception: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-51: ordinal not in range(128)
Original trace:
原始轨迹:
Traceback (most recent call last):
File "translit_error.py", line 124, in <module>
print translit2(u"Привет") # throws exception: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-51: ordinal not in range(128)
File "translit_error.py", line 103, in translit2
u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-51: ordinal not in range(128)
I mean, why Python string.maketranstrying to use asciitable anyway? And how comes English alphabet letters are out of 0-128 range?
我的意思是,为什么 Python仍然string.maketrans尝试使用asciitable?为什么英文字母不在 0-128 范围内?
$ python -c "print ord(u'A')"
65
$ python -c "print ord(u'z')"
122
$ python -c "print ord(u\"'\")"
39
After several hours I feel like absolutely exhausted to solve this issue.
几个小时后,我觉得解决这个问题已经筋疲力尽了。
Can someone say what is happening and how to fix it?
有人可以说发生了什么以及如何解决吗?
采纳答案by georg
translatebehaves differently when used with unicode strings. Instead of a maketranstable, you have to provide a dictionary ord(search)->ord(replace):
当与 unicode 字符串一起使用时,translate 的行为有所不同。maketrans您必须提供字典而不是表格ord(search)->ord(replace):
symbols = (u"абвгдеёжзийклмнопрстуфхцчшщъыьэюяАБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ",
u"abvgdeejzijklmnoprstufhzcss_y_euaABVGDEEJZIJKLMNOPRSTUFHZCSS_Y_EUA")
tr = {ord(a):ord(b) for a, b in zip(*symbols)}
# for Python 2.*:
# tr = dict( [ (ord(a), ord(b)) for (a, b) in zip(*symbols) ] )
text = u'Добрый Ден'
print text.translate(tr) # looks good
That said, I'd second the suggestion not to reinvent the wheel and to use an established library: http://pypi.python.org/pypi/Unidecode
也就是说,我建议不要重新发明轮子并使用已建立的库:http: //pypi.python.org/pypi/Unidecode
回答by Artur Barseghyan
You can use transliterate package (https://pypi.python.org/pypi/transliterate)
您可以使用音译包(https://pypi.python.org/pypi/transliterate)
Example #1:
示例#1:
from transliterate import translit
print translit("Lorem ipsum dolor sit amet", "ru")
# Лорем ипсум долор сит амет
Example #2:
示例#2:
print translit(u"Лорем ипсум долор сит амет", "ru", reversed=True)
# Lorem ipsum dolor sit amet
回答by Georges
Check out the CyrTranslitpackage, it's specifically made to transliterate from and to Cyrillic script text. It currently supports Serbian, Montenegrin, Macedonian, and Russian.
查看CyrTranslit包,它专门用于音译西里尔文脚本文本。它目前支持塞尔维亚语、黑山语、马其顿语和俄语。
Example usage:
用法示例:
>>> import cyrtranslit
>>> cyrtranslit.supported()
['me', 'sr', 'mk', 'ru']
>>> cyrtranslit.to_latin('Моё судно на воздушной подушке полно угрей', 'ru')
'Moyo sudno na vozdushnoj podushke polno ugrej'
>>> cyrtranslit.to_cyrillic('Moyo sudno na vozdushnoj podushke polno ugrej')
'Моё судно на воздушной подушке полно угрей'

