从python中的unicode字符串获取字节

Question

提问by altunyurt

I have an 16bit big endian unicode string represented as u'\u4132',

我有一个 16 位大端 unicode 字符串表示为u'\u4132'，

how can I split it into integers 41 and 32 in python ?

如何在 python 中将其拆分为整数 41 和 32？

Answer 1

采纳答案by Chris Morgan

Here are a variety of different ways you may want it.

以下是您可能想要的各种不同方式。

Python 2:

蟒蛇2：

>>> chars = u'\u4132'.encode('utf-16be')
>>> chars
'A2'
>>> ord(chars[0])
65
>>> '%x' % ord(chars[0])
'41'
>>> hex(ord(chars[0]))
'0x41'
>>> ['%x' % ord(c) for c in chars]
['41', '32']
>>> [hex(ord(c)) for c in chars]
['0x41', '0x32']

Python 3:

蟒蛇3：

>>> chars = '\u4132'.encode('utf-16be')
>>> chars
b'A2'
>>> chars = bytes('\u4132', 'utf-16be')
>>> chars  # Just the same.
b'A2'
>>> chars[0]
65
>>> '%x' % chars[0]
'41'
>>> hex(chars[0])
'0x41'
>>> ['%x' % c for c in chars]
['41', '32']
>>> [hex(c) for c in chars]
['0x41', '0x32']

Answer 2

回答by Roland Illig

Java: "\u4132".getBytes("UTF-16BE")
Python 2: u'\u4132'.encode('utf-16be')
Python 3: '\u4132'.encode('utf-16be')

爪哇： "\u4132".getBytes("UTF-16BE")
蟒蛇2： u'\u4132'.encode('utf-16be')
蟒蛇3： '\u4132'.encode('utf-16be')

These methods return a byte array, which you can convert to an int array easily. But note that code points above U+FFFFwill be encoded using two code units (so with UTF-16BE this means 32 bits or 4 bytes).

这些方法返回一个字节数组，您可以轻松地将其转换为 int 数组。但请注意，上面的代码点将U+FFFF使用两个代码单元进行编码（因此对于 UTF-16BE，这意味着 32 位或 4 个字节）。

Answer 3

回答by Ivo Wetzel

"Those" aren't integers, it's a hexadecimal number which represents the code point.

“那些”不是整数，它是一个表示代码点的十六进制数。

If you want to get an integer representation of the code point you need to use ord(u'\u4132')if you now want to convert that back to the unicode character use unicode()which will return a unicode string.

如果您想获得代码点的整数表示，ord(u'\u4132')如果您现在想将其转换回unicode()将返回 unicode 字符串的 unicode 字符使用，则需要使用。

Answer 4

回答by seriyPS

Dirty hack: repr(u'\u4132')will return "u'\\u4132'"

肮脏的黑客：repr(u'\u4132')会回来"u'\\u4132'"

Answer 5

回答by jfs

>>> c = u'\u4132'
>>> '%x' % ord(c)
'4132'

Answer 6

回答by Danilo Souza Mor?es

Pass the unicode character to ord()to get its code point and then break that code point into individual bytes with int.to_bytes()and then format the output however you want:

传递 unicode 字符以ord()获取其代码点，然后将该代码点分解为单个字节，int.to_bytes()然后根据需要格式化输出：

list(map(lambda b: hex(b)[2:], ord('\u4132').to_bytes(4, 'big')))

returns: ['0', '0', '41', '32']

返回： ['0', '0', '41', '32']

list(map(lambda b: hex(b)[2:], ord('\N{PILE OF POO}').to_bytes(4, 'big')))

returns: ['0', '1', 'f4', 'a9']

返回： ['0', '1', 'f4', 'a9']

As I have mentioned on another comment, encoding the code point to utf16 will not work as expected for code points outside the BMP (Basic Multilingual Plane) since UTF16 will need a surrogate pair to encode those code points.

正如我在另一条评论中提到的，对于 BMP（基本多语言平面）之外的代码点，将代码点编码为 utf16 将无法按预期工作，因为 UTF16 需要一个代理对来对这些代码点进行编码。

从python中的unicode字符串获取字节

提问by altunyurt

采纳答案by Chris Morgan

回答by Roland Illig

回答by Ivo Wetzel

回答by seriyPS

回答by jfs

回答by Danilo Souza Mor?es

相关推荐

最近更新

标签

从python中的unicode字符串获取字节

提问by altunyurt

采纳答案by Chris Morgan

回答by Roland Illig

回答by Ivo Wetzel

回答by seriyPS

回答by jfs

回答by Danilo Souza Mor?es

相关推荐

Python 模块的绝对与显式相对导入

Python 值得使用 sqlalchemy-migrate 吗？

Python 如何在 lxml 中通过 find/findall 使用 xml 命名空间？

Python 从另一个列表中删除出现在一个列表中的所有元素

相关推荐

最近更新

标签