如何将 unicode 类型与 python 中的字符串进行比较？

Question

提问by rGil

I am trying to use a list comprehension that compares string objects, but one of the strings is utf-8, the byproduct of json.loads. Scenario:

我正在尝试使用比较字符串对象的列表推导式，但其中一个字符串是 utf-8，它是 json.loads 的副产品。设想：

us = u'MyString' # is the utf-8 string

Part one of my question, is why does this return False? :

我的问题的第一部分是，为什么这会返回 False？：

us.encode('utf-8') == "MyString" ## False

Part two - how can I compare within a list comprehension?

第二部分 - 如何在列表理解中进行比较？

myComp = [utfString for utfString in jsonLoadsObj
           if utfString.encode('utf-8') == "MyString"] #wrapped to read on S.O.

EDIT: I'm using Google App Engine, which uses Python 2.7

编辑：我使用的是使用 Python 2.7 的 Google App Engine

Here's a more complete example of the problem:

这是一个更完整的问题示例：

#json coming from remote server:
#response object looks like:  {"number1":"first", "number2":"second"}

data = json.loads(response)
k = data.keys()

I need something like:
myList = [item for item in k if item=="number1"]  

#### I thought this would work:
myList = [item for item in k if item.encode('utf-8')=="number1"]

Answer 1

采纳答案by Martijn Pieters

You must be looping over the wrong data set; just loop directly over the JSON-loaded dictionary, there is no need to call .keys()first:

你一定是在循环错误的数据集；直接在 JSON 加载的字典上循环，不需要先调用.keys()：

data = json.loads(response)
myList = [item for item in data if item == "number1"]

You may want to use u"number1"to avoid implicit conversions between Unicode and byte strings:

您可能希望使用u"number1"来避免 Unicode 和字节字符串之间的隐式转换：

data = json.loads(response)
myList = [item for item in data if item == u"number1"]

Both versions work fine:

两个版本都可以正常工作：

>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']

Note that in your first example, usis nota UTF-8 string; it is unicode data, the jsonlibrary has already decoded it for you. A UTF-8 string on the other hand, is a sequence encoded bytes. You may want to read up on Unicode and Python to understand the difference:

请注意，在你的第一个例子，us是不是一个UTF-8串; 它是unicode数据，json图书馆已经为你解码了。另一方面，UTF-8 字符串是一个序列编码的 bytes。您可能需要阅读 Unicode 和 Python 以了解差异：

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicodeby Ned Batchelder

每个软件开发人员绝对、肯定地必须了解 Unicode 和字符集的绝对最低要求（没有任何借口！）作者：Joel Spolsky
在Python的Unicode指南
内德巴切尔德的实用 Unicode

On Python 2, your expectation that your test returns Truewould be correct, you are doing something else wrong:

在 Python 2 上，您对测试返回的True期望是正确的，但您做错了其他事情：

>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>

There is noneed to encode the strings to UTF-8 to make comparisons; use unicode literals instead:

有没有必要将字符串编码成UTF-8进行比较; 改用 unicode 文字：

myComp = [elem for elem in json_data if elem == u"MyString"]

Answer 2

回答by MattDMo

I'm assuming you're using Python 3. us.encode('utf-8') == "MyString"returns Falsebecause the str.encode()function is returning a bytes object:

我假设您使用的是 Python 3.us.encode('utf-8') == "MyString"返回，False因为该str.encode()函数正在返回一个字节对象：

In [2]: us.encode('utf-8')
Out[2]: b'MyString'

In Python 3, strings are already Unicode, so the u'MyString'is superfluous.

在 Python 3 中，字符串已经是 Unicode，所以u'MyString'是多余的。

Answer 3

回答by wberry

You are trying to compare a string of bytes ('MyString') with a string of Unicode code points (u'MyString'). This is an "apples and oranges" comparison. Unfortunately, Python 2 pretends in some cases that this comparison is valid, instead of always returning False:

您正在尝试将一串字节 ( 'MyString') 与一串 Unicode 代码点 ( u'MyString') 进行比较。这是一个“苹果和橘子”的比较。不幸的是，Python 2 在某些情况下假装这种比较是有效的，而不是总是返回False：

>>> u'MyString' == 'MyString'  # in my opinion should be False
True

It's up to you as the designer/developer to decide what the correct comparison should be. Here is one possible way:

由您作为设计师/开发人员来决定正确的比较应该是什么。这是一种可能的方法：

a = u'MyString'
b = 'MyString'
a.encode('UTF-8') == b  # True

I recommend the above instead of a == b.decode('UTF-8')because all u''style strings can be encoded into bytes with UTF-8, except possibly in some bizarre cases, but not all byte-strings can be decoded to Unicode that way.

我推荐上述而不是a == b.decode('UTF-8')因为所有u''样式字符串都可以使用 UTF-8 编码为字节，除非可能在某些奇怪的情况下，但并非所有字节字符串都可以通过这种方式解码为 Unicode。

But if you choose to do a UTF-8 encode of the Unicode strings before comparing, that will fail for something like this on a Windows system: u'Em dashes\u2014are cool'.encode('UTF-8') == 'Em dashes\x97are cool'. But if you .encode('Windows-1252')instead it would succeed. That's why it's an apples and oranges comparison.

但是，如果你选择做比较之前的Unicode字符串，将为在Windows系统上像这样失败的UTF-8编码：u'Em dashes\u2014are cool'.encode('UTF-8') == 'Em dashes\x97are cool'。但如果你.encode('Windows-1252')改为它会成功。这就是为什么它是苹果和橙子的比较。

如何将 unicode 类型与 python 中的字符串进行比较？

提问by rGil

采纳答案by Martijn Pieters

回答by MattDMo

回答by wberry

相关推荐

最近更新

标签

如何将 unicode 类型与 python 中的字符串进行比较？

提问by rGil

采纳答案by Martijn Pieters

回答by MattDMo

回答by wberry

相关推荐

Python urrlib2.urlopen：在没有互联网连接的情况下启动脚本时，“名称或服务未知”仍然存在

Python 使用 Scikit-learn 对日期变量进行回归

Mac OS X 10.6 上缺少 Python.h 头文件

Python Tkinter 刷新画布

相关推荐

最近更新

标签