如何将 unicode 类型与 python 中的字符串进行比较?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16471332/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I compare a unicode type to a string in python?
提问by rGil
I am trying to use a list comprehension that compares string objects, but one of the strings is utf-8, the byproduct of json.loads. Scenario:
我正在尝试使用比较字符串对象的列表推导式,但其中一个字符串是 utf-8,它是 json.loads 的副产品。设想:
us = u'MyString' # is the utf-8 string
Part one of my question, is why does this return False? :
我的问题的第一部分是,为什么这会返回 False?:
us.encode('utf-8') == "MyString" ## False
Part two - how can I compare within a list comprehension?
第二部分 - 如何在列表理解中进行比较?
myComp = [utfString for utfString in jsonLoadsObj
if utfString.encode('utf-8') == "MyString"] #wrapped to read on S.O.
EDIT: I'm using Google App Engine, which uses Python 2.7
编辑:我使用的是使用 Python 2.7 的 Google App Engine
Here's a more complete example of the problem:
这是一个更完整的问题示例:
#json coming from remote server:
#response object looks like: {"number1":"first", "number2":"second"}
data = json.loads(response)
k = data.keys()
I need something like:
myList = [item for item in k if item=="number1"]
#### I thought this would work:
myList = [item for item in k if item.encode('utf-8')=="number1"]
采纳答案by Martijn Pieters
You must be looping over the wrong data set; just loop directly over the JSON-loaded dictionary, there is no need to call .keys()first:
你一定是在循环错误的数据集;直接在 JSON 加载的字典上循环,不需要先调用.keys():
data = json.loads(response)
myList = [item for item in data if item == "number1"]
You may want to use u"number1"to avoid implicit conversions between Unicode and byte strings:
您可能希望使用u"number1"来避免 Unicode 和字节字符串之间的隐式转换:
data = json.loads(response)
myList = [item for item in data if item == u"number1"]
Both versions work fine:
两个版本都可以正常工作:
>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']
Note that in your first example, usis nota UTF-8 string; it is unicode data, the jsonlibrary has already decoded it for you. A UTF-8 string on the other hand, is a sequence encoded bytes. You may want to read up on Unicode and Python to understand the difference:
请注意,在你的第一个例子,us是不是一个UTF-8串; 它是unicode数据,json图书馆已经为你解码了。另一方面,UTF-8 字符串是一个序列编码的 bytes。您可能需要阅读 Unicode 和 Python 以了解差异:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)by Joel Spolsky
Pragmatic Unicodeby Ned Batchelder
每个软件开发人员绝对、肯定地必须了解 Unicode 和字符集的绝对最低要求(没有任何借口!)作者:Joel Spolsky
内德巴切尔德的实用 Unicode
On Python 2, your expectation that your test returns Truewould be correct, you are doing something else wrong:
在 Python 2 上,您对测试返回的True期望是正确的,但您做错了其他事情:
>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>
There is noneed to encode the strings to UTF-8 to make comparisons; use unicode literals instead:
有没有必要将字符串编码成UTF-8进行比较; 改用 unicode 文字:
myComp = [elem for elem in json_data if elem == u"MyString"]
回答by MattDMo
I'm assuming you're using Python 3. us.encode('utf-8') == "MyString"returns Falsebecause the str.encode()function is returning a bytes object:
我假设您使用的是 Python 3.us.encode('utf-8') == "MyString"返回,False因为该str.encode()函数正在返回一个字节对象:
In [2]: us.encode('utf-8')
Out[2]: b'MyString'
In Python 3, strings are already Unicode, so the u'MyString'is superfluous.
在 Python 3 中,字符串已经是 Unicode,所以u'MyString'是多余的。
回答by wberry
You are trying to compare a string of bytes ('MyString') with a string of Unicode code points (u'MyString'). This is an "apples and oranges" comparison. Unfortunately, Python 2 pretends in some cases that this comparison is valid, instead of always returning False:
您正在尝试将一串字节 ( 'MyString') 与一串 Unicode 代码点 ( u'MyString') 进行比较。这是一个“苹果和橘子”的比较。不幸的是,Python 2 在某些情况下假装这种比较是有效的,而不是总是返回False:
>>> u'MyString' == 'MyString' # in my opinion should be False
True
It's up to you as the designer/developer to decide what the correct comparison should be. Here is one possible way:
由您作为设计师/开发人员来决定正确的比较应该是什么。这是一种可能的方法:
a = u'MyString'
b = 'MyString'
a.encode('UTF-8') == b # True
I recommend the above instead of a == b.decode('UTF-8')because all u''style strings can be encoded into bytes with UTF-8, except possibly in some bizarre cases, but not all byte-strings can be decoded to Unicode that way.
我推荐上述而不是a == b.decode('UTF-8')因为所有u''样式字符串都可以使用 UTF-8 编码为字节,除非可能在某些奇怪的情况下,但并非所有字节字符串都可以通过这种方式解码为 Unicode。
But if you choose to do a UTF-8 encode of the Unicode strings before comparing, that will fail for something like this on a Windows system: u'Em dashes\u2014are cool'.encode('UTF-8') == 'Em dashes\x97are cool'. But if you .encode('Windows-1252')instead it would succeed. That's why it's an apples and oranges comparison.
但是,如果你选择做比较之前的Unicode字符串,将为在Windows系统上像这样失败的UTF-8编码:u'Em dashes\u2014are cool'.encode('UTF-8') == 'Em dashes\x97are cool'。但如果你.encode('Windows-1252')改为它会成功。这就是为什么它是苹果和橙子的比较。

