python Python在使用特殊字符时返回错误长度的字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2247205/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-04 00:10:19  来源:igfitidea点击:

Python returning the wrong length of string when using special characters

pythoncharacter-encoding

提问by roflwaffle

I have a string ??aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ?? is being counted twice, or I guess ? is in position 0 and ′ is in position 1.

我有一个字符串 ??aúlt ,我想根据字符位置等获取操作的长度。问题是第一个??被计算两次,或者我猜?在位置 0 和 ' 在位置 1。

Is there any possible way in Python to have a character like ?? be represented as 1?

Python 中是否有任何可能的方法来拥有像 ?? 表示为 1?

I'm using UTF-8 encoding for the actual code and web page it is being outputted to.

我正在将 UTF-8 编码用于实际代码和它输出到的网页。

edit: Just some background on why I need to do this. I am working on a project that translates English to Seneca (a form of Native American language) and ?? shows up quite a bit. Some rewrite rules for certain words require knowledge of letter position (itself and surrounding letters) and other characteristics, such as accents and other diacritic markings.

编辑:只是一些关于为什么我需要这样做的背景。我正在做一个项目,将英语翻译成 Seneca(一种美洲原住民语言)和 ?? 出现了很多。某些单词的一些重写规则需要了解字母位置(本身和周围的字母)和其他特征,例如重音和其他变音符号。

回答by tux21b

UTF-8 is an unicode encoding which uses more than one byte for special characters. If you don't want the length of the encoded string, simple decode it and use len()on the unicodeobject (and not the strobject!).

UTF-8 是一种 unicode 编码,它使用多个字节来表示特殊字符。如果您不想要编码字符串的长度,只需对其进行解码并len()unicode对象上使用(而不是str对象!)。

Here are some examples:

这里有些例子:

>>> # creates a str literal (with utf-8 encoding, if this was
>>> # specified on the beginning of the file):
>>> len('??aúlt') 
9
>>> # creates a unicode literal (you should generally use this
>>> # version if you are dealing with special characters):
>>> len(u'??aúlt') 
6
>>> # the same str literal (written in an encoded notation):
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt') 
9
>>> # you can convert any str to an unicode object by decoding() it:
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt'.decode('utf-8')) 
6

Of course, you can also access single characters in an unicodeobject like you would do in a strobject (they are both inheriting from basestringand therefore have the same methods):

当然,您也可以unicode像在对象中那样访问对象中的单个字符str(它们都继承自basestring并因此具有相同的方法):

>>> test = u'??aúlt'
>>> print test[0]
?

If you develop localized applications, it's generally a good idea to use only unicode-objects internally, by decoding all inputs you get. After the work is done, you can encode the result again as 'UTF-8'. If you keep to this principle, you will never see your server crashing because of any internal UnicodeDecodeErrors you might get otherwise ;)

如果您开发本地化应用程序,unicode通过解码您获得的所有输入,只在内部使用-objects通常是一个好主意。工作完成后,您可以再次将结果编码为“UTF-8”。如果你坚持这个原则,你永远不会看到你的服务器因为任何内部UnicodeDecodeErrors 而崩溃,否则你可能会得到 ;)

PS: Please note, that the strand unicodedatatype have changed significantly in Python 3. In Python 3 there are only unicode strings and plain byte strings which can't be mixed anymore. That should help to avoid common pitfalls with unicode handling...

PS:请注意,Python 3中的strunicode数据类型发生了显着变化。在 Python 3 中,只有 unicode 字符串和纯字节字符串不能再混合使用。这应该有助于避免 unicode 处理的常见陷阱......

Regards, Christoph

问候, 克里斯托夫

回答by bobince

The problem is that the first ?? is being counted twice, or I guess ? is in position 0 and ′ is in position 1.

问题是第一个??被计算两次,或者我猜?在位置 0 和 ' 在位置 1。

Yes. That's how code points are defined by Unicode. In general, you can ask Python to convert a letter and a separate ‘combining' diacritical mark like U+0301 COMBINING ACUTE ACCENT using Unicode normalisation:

是的。这就是 Unicode 定义代码点的方式。通常,您可以要求 Python 使用 Unicode 规范化转换一个字母和一个单独的“组合”变音标记,如 U+0301 COMBINING ACUTE ACCENT:

>>> unicodedata.normalize('NFC', u'a\u0301')
u'\xe1' # single character: á

However, there is no single character in Unicode for “e with diaeresis and acute accent” because no language in the world has ever used the letter ‘??'. (Pinyin transliteration has “u with diaeresis and acute accent”, but not ‘e'.) Consequently font support is poor; it renders really badly in many cases and is a messy blob on my web browser.

然而,Unicode 中没有单独的字符表示“带有分音符和重音符号的 e”,因为世界上没有任何语言使用过字母 '??'。(拼音音译有“u 带分音符和重音”,但没有“e”。)因此字体支持很差;它在许多情况下渲染得非常糟糕,并且在我的网络浏览器上是一个凌乱的斑点。

To work out where the ‘editable points' in a string of Unicode code points are is a tricky job that requires quite a bit of domain knowledge of languages. It's part of the issue of “complex text layout”, an area which also includes issues such as bidirectional text and contextual glpyh shaping and ligatures. To do complex text layout you'll need a library such as Uniscribe on Windows, or Pango generally (for which there is a Python interface).

找出一串 Unicode 代码点中“可编辑点”的位置是一项棘手的工作,需要相当多的语言领域知识。这是“复杂文本布局”问题的一部分,该领域还包括双向文本和上下文 glpyh 成形和连字等问题。要进行复杂的文本布局,您需要一个库,例如 Windows 上的 Uniscribe 或一般的 Pango(有一个 Python 接口)。

If, on the other hand, you merely want to completely ignore all combining characters when doing a count, you can get rid of them easily enough:

另一方面,如果您只想在进行计数时完全忽略所有组合字符,则可以很容易地摆脱它们:

def withoutcombining(s):
    return ''.join(c for c in s if unicodedata.combining(c)==0)

>>> withoutcombining(u'??aúlt')
'\xeba\xfalt' # ?aúlt
>>> len(_)
5

回答by Ignacio Vazquez-Abrams

The best you can do is to use unicodedata.normalize()to decompose the character and then filter out the accents.

您能做的最好的事情是使用unicodedata.normalize()分解字符,然后过滤掉重音。

Don't forget to use unicodeand unicode literals in your code.

不要忘记unicode在代码中使用和 unicode 文字。

回答by Ignacio Vazquez-Abrams

which Python version are you using? Python 3.1 doesn't have this issue.

您使用的是哪个 Python 版本?Python 3.1 没有这个问题。

>>> print(len("??aúlt"))
6

Regards Djoudi

问候朱迪

回答by John Machin

You said: I have a string ??aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ?? is being counted twice, or I guess ? is in position 0 and ′ is in position 1.

你说:我有一个字符串 ??aúlt ,我想根据字符位置等获取操作的长度。问题是第一个??被计算两次,或者我猜?在位置 0 和 ' 在位置 1。

The first step in working on any Unicode problem is to know exactly what is in your data; don't guess. In this case your guess is correct; it won't always be.

处理任何 Unicode 问题的第一步是准确了解数据中的内容;不要猜。在这种情况下,您的猜测是正确的;不会总是这样。

"Exactly what is in your data": use the repr() built-in function (for lots more things apart from unicode). A useful advantage of showing the repr() output in your question is that answerers then have exactly what you have. Note that your text displays in only FOUR positions instead of 5 with some browsers/fonts -- the 'e' and its diacritics and the 'a' are mangled together in one position.

“正是您的数据中的内容”:使用 repr() 内置函数(用于除 unicode 之外的更多内容)。在您的问题中显示 repr() 输出的一个有用的优点是,回答者可以准确地获得您所拥有的内容。请注意,在某些浏览器/字体中,您的文本仅显示在四个位置而不是 5 个位置——“e”及其变音符号和“a”在一个位置上混杂在一起。

You can use the unicodedata.name() function to tell you what each component is.

您可以使用 unicodedata.name() 函数来告诉您每个组件是什么。

Here's an example:

下面是一个例子:

# coding: utf8
import unicodedata
x = u"??aúlt"
print(repr(x))
for c in x:
    try:
        name = unicodedata.name(c)
    except:
        name = "<no name>"
    print "U+%04X" % ord(c), repr(c), name

Results:

结果:

u'\xeb\u0301a\xfalt'
U+00EB u'\xeb' LATIN SMALL LETTER E WITH DIAERESIS
U+0301 u'\u0301' COMBINING ACUTE ACCENT
U+0061 u'a' LATIN SMALL LETTER A
U+00FA u'\xfa' LATIN SMALL LETTER U WITH ACUTE
U+006C u'l' LATIN SMALL LETTER L
U+0074 u't' LATIN SMALL LETTER T

Now read @bobince's answer :-)

现在阅读@bobince 的回答:-)