如何将 unicode 类型与 python 中的字符串进行比较?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16471332/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 22:43:53  来源:igfitidea点击:

How can I compare a unicode type to a string in python?

pythonunicodepython-2.7list-comprehension

提问by rGil

I am trying to use a list comprehension that compares string objects, but one of the strings is utf-8, the byproduct of json.loads. Scenario:

我正在尝试使用比较字符串对象的列表推导式,但其中一个字符串是 utf-8,它是 json.loads 的副产品。设想:

us = u'MyString' # is the utf-8 string

Part one of my question, is why does this return False? :

我的问题的第一部分是,为什么这会返回 False?:

us.encode('utf-8') == "MyString" ## False

Part two - how can I compare within a list comprehension?

第二部分 - 如何在列表理解中进行比较?

myComp = [utfString for utfString in jsonLoadsObj
           if utfString.encode('utf-8') == "MyString"] #wrapped to read on S.O.

EDIT: I'm using Google App Engine, which uses Python 2.7

编辑:我使用的是使用 Python 2.7 的 Google App Engine

Here's a more complete example of the problem:

这是一个更完整的问题示例:

#json coming from remote server:
#response object looks like:  {"number1":"first", "number2":"second"}

data = json.loads(response)
k = data.keys()

I need something like:
myList = [item for item in k if item=="number1"]  

#### I thought this would work:
myList = [item for item in k if item.encode('utf-8')=="number1"]

采纳答案by Martijn Pieters

You must be looping over the wrong data set; just loop directly over the JSON-loaded dictionary, there is no need to call .keys()first:

你一定是在循环错误的数据集;直接在 JSON 加载的字典上循环,不需要先调用.keys()

data = json.loads(response)
myList = [item for item in data if item == "number1"]  

You may want to use u"number1"to avoid implicit conversions between Unicode and byte strings:

您可能希望使用u"number1"来避免 Unicode 和字节字符串之间的隐式转换:

data = json.loads(response)
myList = [item for item in data if item == u"number1"]  

Both versions work fine:

两个版本都可以正常工作

>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']

Note that in your first example, usis nota UTF-8 string; it is unicode data, the jsonlibrary has already decoded it for you. A UTF-8 string on the other hand, is a sequence encoded bytes. You may want to read up on Unicode and Python to understand the difference:

请注意,在你的第一个例子,us不是一个UTF-8串; 它是unicode数据,json图书馆已经为你解码了。另一方面,UTF-8 字符串是一个序列编码的 bytes。您可能需要阅读 Unicode 和 Python 以了解差异:

On Python 2, your expectation that your test returns Truewould be correct, you are doing something else wrong:

在 Python 2 上,您对测试返回的True期望是正确的,但您做错了其他事情:

>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>

There is noneed to encode the strings to UTF-8 to make comparisons; use unicode literals instead:

没有必要将字符串编码成UTF-8进行比较; 改用 unicode 文字:

myComp = [elem for elem in json_data if elem == u"MyString"]

回答by MattDMo

I'm assuming you're using Python 3. us.encode('utf-8') == "MyString"returns Falsebecause the str.encode()function is returning a bytes object:

我假设您使用的是 Python 3.us.encode('utf-8') == "MyString"返回,False因为该str.encode()函数正在返回一个字节对象

In [2]: us.encode('utf-8')
Out[2]: b'MyString'

In Python 3, strings are already Unicode, so the u'MyString'is superfluous.

在 Python 3 中,字符串已经是 Unicode,所以u'MyString'是多余的。

回答by wberry

You are trying to compare a string of bytes ('MyString') with a string of Unicode code points (u'MyString'). This is an "apples and oranges" comparison. Unfortunately, Python 2 pretends in some cases that this comparison is valid, instead of always returning False:

您正在尝试将一串字节 ( 'MyString') 与一串 Unicode 代码点 ( u'MyString') 进行比较。这是一个“苹果和橘子”的比较。不幸的是,Python 2 在某些情况下假装这种比较是有效的,而不是总是返回False

>>> u'MyString' == 'MyString'  # in my opinion should be False
True

It's up to you as the designer/developer to decide what the correct comparison should be. Here is one possible way:

由您作为设计师/开发人员来决定正确的比较应该是什么。这是一种可能的方法:

a = u'MyString'
b = 'MyString'
a.encode('UTF-8') == b  # True

I recommend the above instead of a == b.decode('UTF-8')because all u''style strings can be encoded into bytes with UTF-8, except possibly in some bizarre cases, but not all byte-strings can be decoded to Unicode that way.

我推荐上述而不是a == b.decode('UTF-8')因为所有u''样式字符串都可以使用 UTF-8 编码为字节,除非可能在某些奇怪的情况下,但并非所有字节字符串都可以通过这种方式解码为 Unicode。

But if you choose to do a UTF-8 encode of the Unicode strings before comparing, that will fail for something like this on a Windows system: u'Em dashes\u2014are cool'.encode('UTF-8') == 'Em dashes\x97are cool'. But if you .encode('Windows-1252')instead it would succeed. That's why it's an apples and oranges comparison.

但是,如果你选择做比较之前的Unicode字符串,将为在Windows系统上像这样失败的UTF-8编码:u'Em dashes\u2014are cool'.encode('UTF-8') == 'Em dashes\x97are cool'。但如果你.encode('Windows-1252')改为它会成功。这就是为什么它是苹果和橙子的比较。