Python UTF-8 比较

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3400171/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 10:51:45  来源:igfitidea点击:

Python UTF-8 comparison

pythonunicodeutf-8python-2.x

提问by erkangur

a = {"a":"??"}
b = "??"
a['a']
>>> '\xc3\xa7\xc3\xb6'

b.decode('utf-8') == a['a']
>>> False

What is going in there?

里面有什么?

edit= I'm sorry, it was my mistake. It is still False. I'm using Python 2.6 on Ubuntu 10.04.

编辑=对不起,这是我的错误。它仍然是假的。我在 Ubuntu 10.04 上使用 Python 2.6。

采纳答案by Bolo

Possible solutions

可能的解决方案

Either write like this:

要么这样写:

a = {"a": u"??"}
b = "??"
b.decode('utf-8') == a['a']

Or like this (you may also skip the .decode('utf-8')on both sides):

或者像这样(你也可以跳过.decode('utf-8')两边):

a = {"a": "??"}
b = "??"
b.decode('utf-8') == a['a'].decode('utf-8')

Or like this (my recommendation):

或者像这样(我的建议):

a = {"a": u"??"}
b = u"??"
b == a['a']

Explanation

解释

Updated based on Tim's comment.In your original code, b.decode('utf-8') == u'??'and a['a'] == '??', so you're actually making the following comparison:

根据蒂姆的评论更新。在您的原始代码中,b.decode('utf-8') == u'??'a['a'] == '??', 因此您实际上是在进行以下比较:

u'??' == '??'

One of the objects is of type unicode, the other is of type str, so in order to execute the comparison, the stris converted to unicodeand then the two unicodeobjects are compared. It works fine in the case of purely ASCII strings, for example: u'a' == 'a', since unicode('a') == u'a'.

其中一个对象是 type unicode,另一个是 type str,因此为了执行比较,将str转换为unicode,然后unicode比较两个对象。它在纯 ASCII 字符串的情况下工作正常,例如:u'a' == 'a', 因为unicode('a') == u'a'.

However, it fails in case of u'??' == '??', since unicode('??')returns the following error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128), and therefore the whole comparison returns False and issues the following warning: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal.

但是,它在 的情况下失败u'??' == '??',因为unicode('??')返回以下错误:UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128),因此整个比较返回 False 并发出以下警告: UnicodeWarning: Unicode 相等比较未能将两个参数转换为 Unicode - 将它们解释为不相等

回答by NullUserException

bis a string, ais a dict

b是一个stringa是一个dict

You want (I believe):

你想要(我相信):

b == a['a']

b == a['a']

回答by PaulMcG

Try b == a['a']

试试 b == a['a']

回答by brennie

You are comparing a string to a dict.

您正在将字符串与字典进行比较。

>>> a = {"a":"??"}
>>> b = "??"
>>> a == b
False
>>> a['a'] == b
True

If you compare the string (b) to the member of a (a['a']), then you get the desired result.

如果将字符串 (b) 与 a (a['a']) 的成员进行比较,则会得到所需的结果。

回答by Michael Dillon

UTF-8 is an encoding used to record Unicode text in files. However, in Python you are working with objects that have a fixed way to represent Unicode text, and that way is not UTF-8.

UTF-8 是一种用于在文件中记录 Unicode 文本的编码。但是,在 Python 中,您正在使用具有固定方式来表示 Unicode 文本的对象,而这种方式不是 UTF-8。

You can still compare Unicode strings in Python, but this is unrelated to UTF-8, except that if you want to put constants into these Unicode strings, then you will need to encode the text of the file containing your source code, in UTF-8. As soon as the assignment operator is executed, the string is no longer UTF-8, but is now the Python internal representation.

您仍然可以在 Python 中比较 Unicode 字符串,但这与 UTF-8 无关,除非您想将常量放入这些 Unicode 字符串中,那么您需要将包含源代码的文件的文本编码为 UTF- 8. 一旦执行赋值运算符,字符串就不再是 UTF-8,而是现在是 Python 内部表示。

By the way, if you are doing comparisons with Unicode, you probably will want to use the unicodedata module and normalize the strings before comparisons are done.

顺便说一下,如果您要与 Unicode 进行比较,您可能需要使用 unicodedata 模块并在完成比较之前对字符串进行规范化。

回答by Jason Scheirer

Make sure your code is in UTF-8 (NOT Latin-1) and/or use a coding line as so:

确保您的代码是 UTF-8(不是 Latin-1)和/或使用如下编码行:

#! /usr/bin/python
# -*- coding: utf-8 -*-
a = {"a": u"??"}
b = "??"
assert b == a['a']
assert b.decode('utf-8') == a['a'].decode('utf-8')

If you're using unicode across the board, you can import unicode_literals from the future and cut back on encoding heartaches:

如果您全面使用 unicode,您可以从未来导入 unicode_literals 并减少编码心痛:

#! /usr/bin/python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
a = {"a": u"??"}
b = "??"
assert b == a['a']
assert b == a['a']
assert b.encode('utf-8') != a['a']
assert b.encode('utf-8') == a['a'].encode('utf-8')

If a file uses unicode_literals, all "strings" are now u"unicode" objects (per the coding of the file) if they're not b"prepended" with a b (to emulate the string/bytes split in Python 3.X).

如果文件使用 unicode_literals,则所有“字符串”现在都是 u“unicode”对象(根据文件的编码),如果它们不是 b“前置”与 ab(模拟 Python 3.X 中拆分的字符串/字节) .

回答by chryss

NullUserException is right that this should be correct:

NullUserException 是正确的,这应该是正确的:

b == a['a']

You're still getting "False" because you're decoding one side as utf-8 (creating a Unicode string) while the other side remains a utf-8 encoded byte string.

您仍然得到“假”,因为您将一侧解码为 utf-8(创建 Unicode 字符串),而另一侧仍然是 utf-8 编码的字节字符串。