Python 规范化 Unicode
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16467479/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Normalizing Unicode
提问by michaelmeyer
Is there a standard way, in Python, to normalize a unicode string, so that it only comprehends the simplest unicode entities that can be used to represent it ?
在 Python 中是否有一种标准方法来规范化 unicode 字符串,以便它只包含可用于表示它的最简单的 unicode 实体?
I mean, something which would translate a sequence like ['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT']to ['LATIN SMALL LETTER A WITH ACUTE']?
我的意思是,可以将序列转换['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT']为['LATIN SMALL LETTER A WITH ACUTE']?
See where is the problem:
看看问题出在哪里:
>>> import unicodedata
>>> char = "á"
>>> len(char)
1
>>> [ unicodedata.name(c) for c in char ]
['LATIN SMALL LETTER A WITH ACUTE']
But now:
但现在:
>>> char = "a?"
>>> len(char)
2
>>> [ unicodedata.name(c) for c in char ]
['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT']
I could, of course, iterate over all the chars and do manual replacements, etc., but it is not efficient, and I'm pretty sure I would miss half of the special cases, and do mistakes.
当然,我可以遍历所有字符并进行手动替换等,但效率不高,而且我很确定我会错过一半的特殊情况,并且会出错。
采纳答案by Martijn Pieters
The unicodedatamodule offers a .normalize()function, you want to normalize to the NFC form:
该unicodedata模块提供了一个.normalize()功能,你想规范化为NFC形式:
>>> unicodedata.normalize('NFC', u'\u0061\u0301')
u'\xe1'
>>> unicodedata.normalize('NFD', u'\u00e1')
u'a\u0301'
NFC, or 'Normal Form Composed' returns composed characters, NFD, 'Normal Form Decomposed' gives you decomposed, combined characters.
NFC 或“正常形式组合”返回组合字符,NFD,“正常形式分解”为您提供分解的组合字符。
The additional NFKC and NFKD forms deal with compatibility codepoints; e.g. U+2160 (ROMAN NUMERAL ONE) is really just the same thing as U+0049 (LATIN CAPITAL LETTER I) but present in the Unicode standard to remain compatible with encodings that treat them separately. Using either NFKC or NFKD form, in addition to composing or decomposing characters, will also replace all 'compatibility' characters with their canonical form:
附加的 NFKC 和 NFKD 形式处理兼容性代码点;例如,U+2160(罗马数字一)实际上与 U+0049(拉丁文大写字母 I)相同,但存在于 Unicode 标准中以保持与单独处理它们的编码兼容。使用 NFKC 或 NFKD 形式,除了组合或分解字符外,还将用它们的规范形式替换所有“兼容性”字符:
>>> unicodedata.normalize('NFC', u'\u2167') # roman numeral VIII
u'\u2167'
>>> unicodedata.normalize('NFKC', u'\u2167') # roman numeral VIII
u'VIII'
Note that there is no guarantee that composed and decomposed forms are commutative; normalizing a combined character to NFC form, then converting the result back to NFD form does not always result in the same character sequence. The Unicode standard maintains a list of exceptions; characters on this list are composable, but not decomposable back to their combined form, for various reasons. Also see the documentation on the Composition Exclusion Table.
请注意,不能保证组合形式和分解形式是可交换的;将组合字符规范化为 NFC 形式,然后将结果转换回 NFD 形式并不总是会产生相同的字符序列。Unicode 标准维护了一个例外列表;由于各种原因,此列表中的字符是可组合的,但不能分解回其组合形式。另请参阅成分排除表的文档。
回答by SLaks
是的,有。
unicodedata.normalize(form, unistr)
You need to select one of the four normalization forms.
您需要选择四种规范化形式之一。

