Python 如何检查字符串是 unicode 还是 ascii？

Question

提问by TIMEX

What do I have to do in Python to figure out which encoding a string has?

我必须在 Python 中做什么才能确定字符串具有哪种编码？

Answer 1

采纳答案by Greg Hewgill

In Python 3, all strings are sequences of Unicode characters. There is a bytestype that holds raw bytes.

在 Python 3 中，所有字符串都是 Unicode 字符序列。有一种bytes保存原始字节的类型。

In Python 2, a string may be of type stror of type unicode. You can tell which using code something like this:

在 Python 2 中，字符串可能是 typestr或 type unicode。你可以用这样的代码来判断哪个：

def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.

这不区分“Unicode 或 ASCII”；它只区分 Python 类型。Unicode 字符串可能仅由 ASCII 范围内的字符组成，而字节字符串可能包含 ASCII、编码的 Unicode 甚至非文本数据。

Answer 2

回答by Mikel

How to tell if an object is a unicode string or a byte string

如何判断一个对象是Unicode字符串还是字节字符串

You can use typeor isinstance.

您可以使用type或isinstance。

In Python 2:

在 Python 2 中：

>>> type(u'abc')  # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc')   # Python 2 byte string literal
<type 'str'>

In Python 2, stris just a sequence of bytes. Python doesn't know what its encoding is. The unicodetype is the safer way to store text. If you want to understand this more, I recommend http://farmdev.com/talks/unicode/.

在 Python 2 中，str只是一个字节序列。Python 不知道它的编码是什么。该unicode类型是存储文本的更安全的方式。如果你想更深入地了解这一点，我推荐http://farmdev.com/talks/unicode/。

In Python 3:

在 Python 3 中：

>>> type('abc')   # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc')  # Python 3 byte string literal
<class 'bytes'>

In Python 3, stris like Python 2's unicode, and is used to store text. What was called strin Python 2 is called bytesin Python 3.

在 Python 3 中，str类似于 Python 2 的unicode，用于存储文本。什么叫str在Python 2被称为bytes在Python 3。

How to tell if a byte string is valid utf-8 or ascii

如何判断字节字符串是否有效 utf-8 或 ascii

You can call decode. If it raises a UnicodeDecodeError exception, it wasn't valid.

你可以打电话decode。如果它引发 UnicodeDecodeError 异常，则它无效。

>>> u_umlaut = b'\xc3\x9c'   # UTF-8 representation of the letter 'ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Answer 3

回答by Seb

You could use Universal Encoding Detector, but be aware that it will just give you best guess, not the actual encoding, because it's impossible to know encoding of a string "abc" for example. You will need to get encoding information elsewhere, eg HTTP protocol uses Content-Type header for that.

您可以使用Universal Encoding Detector，但请注意，它只会为您提供最佳猜测，而不是实际编码，因为例如不可能知道字符串“abc”的编码。您将需要在别处获取编码信息，例如 HTTP 协议为此使用 Content-Type 标头。

Answer 4

回答by Alex Dean

Unicode is not an encoding - to quote Kumar McMillan:

Unicode 不是一种编码——引用 Kumar McMillan 的话：

If ASCII, UTF-8, and other byte strings are "text" ...
...then Unicode is "text-ness";
it is the abstract form of text

如果 ASCII、UTF-8 和其他字节字符串是“文本”...
...那么 Unicode 是“文本性”；
它是文本的抽象形式

Have a read of McMillan's Unicode In Python, Completely Demystifiedtalk from PyCon 2008, it explains things a lot better than most of the related answers on Stack Overflow.

阅读 McMillan在 Python 中的Unicode，完全揭开PyCon 2008 的神秘面纱，它比 Stack Overflow 上的大多数相关答案更好地解释了事情。

Answer 5

回答by Dave Burton

If your code needs to be compatible with bothPython 2 and Python 3, you can't directly use things like isinstance(s,bytes)or isinstance(s,unicode)without wrapping them in either try/except or a python version test, because bytesis undefined in Python 2 and unicodeis undefined in Python 3.

如果你的代码需要兼容两者的Python 2和Python 3，你不能直接使用之类的东西isinstance(s,bytes)或isinstance(s,unicode)不带/包裹它们可尝试不同的或Python版本的测试，因为bytes在Python 2不定，unicode在Python 3未定义.

There are some ugly workarounds. An extremely ugly one is to compare the nameof the type, instead of comparing the type itself. Here's an example:

有一些丑陋的解决方法。一个极其丑陋的方法是比较类型的名称，而不是比较类型本身。下面是一个例子：

# convert bytes (python 3) or unicode (python 2) to str
if str(type(s)) == "<class 'bytes'>":
    # only possible in Python 3
    s = s.decode('ascii')  # or  s = str(s)[2:-1]
elif str(type(s)) == "<type 'unicode'>":
    # only possible in Python 2
    s = str(s)

An arguably slightly less ugly workaround is to check the Python version number, e.g.:

一个可以说稍微不那么难看的解决方法是检查 Python 版本号，例如：

if sys.version_info >= (3,0,0):
    # for Python 3
    if isinstance(s, bytes):
        s = s.decode('ascii')  # or  s = str(s)[2:-1]
else:
    # for Python 2
    if isinstance(s, unicode):
        s = str(s)

Those are both unpythonic, and most of the time there's probably a better way.

这些都是非pythonic的，大多数时候可能有更好的方法。

Answer 6

回答by ThinkBonobo

In python 3.x all strings are sequences of Unicode characters. and doing the isinstance check for str (which means unicode string by default) should suffice.

在 python 3.x 中，所有字符串都是 Unicode 字符序列。并对 str 进行 isinstance 检查（默认情况下意味着 unicode 字符串）就足够了。

isinstance(x, str)

With regards to python 2.x, Most people seem to be using an if statement that has two checks. one for str and one for unicode.

关于 python 2.x，大多数人似乎都在使用具有两个检查的 if 语句。一种用于 str ，一种用于 unicode。

If you want to check if you have a 'string-like' object all with one statement though, you can do the following:

如果你想用一个语句检查你是否有一个“类似字符串”的对象，你可以执行以下操作：

isinstance(x, basestring)

Answer 7

回答by Veedrac

Note that on Python 3, it's not really fair to say any of:

请注意，在 Python 3 上，说以下任何一项都不公平：

strs are UTFx for any x (eg. UTF8)
strs are Unicode
strs are ordered collections of Unicode characters

strs 是任何 x 的 UTFx（例如 UTF8）
strs 是 Unicode
strs 是 Unicode 字符的有序集合

Python's strtype is (normally) a sequence of Unicode code points, some of which map to characters.

Python 的str类型（通常）是一系列 Unicode 代码点，其中一些映射到字符。

Even on Python 3, it's not as simple to answer this question as you might imagine.

即使在 Python 3 上，回答这个问题也没有你想象的那么简单。

An obvious way to test for ASCII-compatible strings is by an attempted encode:

测试 ASCII 兼容字符串的一种明显方法是尝试编码：

"Hello there!".encode("ascii")
#>>> b'Hello there!'

"Hello there... ?!".encode("ascii")
#>>> Traceback (most recent call last):
#>>>   File "", line 4, in <module>
#>>> UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 15: ordinal not in range(128)

The error distinguishes the cases.

错误区分情况。

In Python 3, there are even some strings that contain invalid Unicode code points:

在 Python 3 中，甚至有些字符串包含无效的 Unicode 代码点：

"Hello there!".encode("utf8")
#>>> b'Hello there!'

"\udcc3".encode("utf8")
#>>> Traceback (most recent call last):
#>>>   File "", line 19, in <module>
#>>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed

The same method to distinguish them is used.

使用相同的方法来区分它们。

Answer 8

回答by jfl

This may help someone else, I started out testing for the string type of the variable s, but for my application, it made more sense to simply return s as utf-8. The process calling return_utf, then knows what it is dealing with and can handle the string appropriately. The code is not pristine, but I intend for it to be Python version agnostic without a version test or importing six. Please comment with improvements to the sample code below to help other people.

这可能对其他人有所帮助，我开始测试变量 s 的字符串类型，但对于我的应用程序，将 s 简单地返回为 utf-8 更有意义。进程调用 return_utf，然后知道它在处理什么并且可以适当地处理字符串。代码不是原始的，但我打算让它与 Python 版本无关，无需版本测试或导入 6。请评论对以下示例代码的改进，以帮助其他人。

def return_utf(s):
    if isinstance(s, str):
        return s.encode('utf-8')
    if isinstance(s, (int, float, complex)):
        return str(s).encode('utf-8')
    try:
        return s.encode('utf-8')
    except TypeError:
        try:
            return str(s).encode('utf-8')
        except AttributeError:
            return s
    except AttributeError:
        return s
    return s # assume it was already utf-8

Answer 9

回答by madjardi

use:

用：

import six
if isinstance(obj, six.text_type)

inside the six library it is represented as:

在六个库中，它表示为：

if PY3:
    string_types = str,
else:
    string_types = basestring,

Answer 10

回答by Vishvajit Pathak

For py2/py3 compatibility simply use

对于 py2/py3 兼容性，只需使用

import six if isinstance(obj, six.text_type)

Python 如何检查字符串是 unicode 还是 ascii？

提问by TIMEX

采纳答案by Greg Hewgill

回答by Mikel

How to tell if an object is a unicode string or a byte string

如何判断一个对象是Unicode字符串还是字节字符串

How to tell if a byte string is valid utf-8 or ascii

如何判断字节字符串是否有效 utf-8 或 ascii

回答by Seb

回答by Alex Dean

回答by Dave Burton

回答by ThinkBonobo

回答by Veedrac

回答by jfl

回答by madjardi

回答by Vishvajit Pathak

相关推荐

最近更新

标签

Python 如何检查字符串是 unicode 还是 ascii？

提问by TIMEX

采纳答案by Greg Hewgill

回答by Mikel

How to tell if an object is a unicode string or a byte string

如何判断一个对象是Unicode字符串还是字节字符串

How to tell if a byte string is valid utf-8 or ascii

如何判断字节字符串是否有效 utf-8 或 ascii

回答by Seb

回答by Alex Dean

回答by Dave Burton

回答by ThinkBonobo

回答by Veedrac

回答by jfl

回答by madjardi

回答by Vishvajit Pathak

相关推荐

Python 从字符串中删除特定的控制字符（\n \r \t）

你如何在 Python 中将读取一个大的 csv 文件分成大小均匀的块？

如何在 Windows 上的 Python 3 中连接到 MySQL？

Python 2.7 获取用户输入并作为不带引号的字符串进行操作

相关推荐

最近更新

标签