从python中的unicode字符串获取字节
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4239666/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
getting bytes from unicode string in python
提问by altunyurt
I have an 16bit big endian unicode string represented as u'\u4132',
我有一个 16 位大端 unicode 字符串表示为u'\u4132',
how can I split it into integers 41 and 32 in python ?
如何在 python 中将其拆分为整数 41 和 32?
采纳答案by Chris Morgan
Here are a variety of different ways you may want it.
以下是您可能想要的各种不同方式。
Python 2:
蟒蛇2:
>>> chars = u'\u4132'.encode('utf-16be')
>>> chars
'A2'
>>> ord(chars[0])
65
>>> '%x' % ord(chars[0])
'41'
>>> hex(ord(chars[0]))
'0x41'
>>> ['%x' % ord(c) for c in chars]
['41', '32']
>>> [hex(ord(c)) for c in chars]
['0x41', '0x32']
Python 3:
蟒蛇3:
>>> chars = '\u4132'.encode('utf-16be')
>>> chars
b'A2'
>>> chars = bytes('\u4132', 'utf-16be')
>>> chars # Just the same.
b'A2'
>>> chars[0]
65
>>> '%x' % chars[0]
'41'
>>> hex(chars[0])
'0x41'
>>> ['%x' % c for c in chars]
['41', '32']
>>> [hex(c) for c in chars]
['0x41', '0x32']
回答by Roland Illig
- Java:
"\u4132".getBytes("UTF-16BE") - Python 2:
u'\u4132'.encode('utf-16be') - Python 3:
'\u4132'.encode('utf-16be')
- 爪哇:
"\u4132".getBytes("UTF-16BE") - 蟒蛇2:
u'\u4132'.encode('utf-16be') - 蟒蛇3:
'\u4132'.encode('utf-16be')
These methods return a byte array, which you can convert to an int array easily. But note that code points above U+FFFFwill be encoded using two code units (so with UTF-16BE this means 32 bits or 4 bytes).
这些方法返回一个字节数组,您可以轻松地将其转换为 int 数组。但请注意,上面的代码点将U+FFFF使用两个代码单元进行编码(因此对于 UTF-16BE,这意味着 32 位或 4 个字节)。
回答by Ivo Wetzel
"Those" aren't integers, it's a hexadecimal number which represents the code point.
“那些”不是整数,它是一个表示代码点的十六进制数。
If you want to get an integer representation of the code point you need to use ord(u'\u4132')if you now want to convert that back to the unicode character use unicode()which will return a unicode string.
如果您想获得代码点的整数表示,ord(u'\u4132')如果您现在想将其转换回unicode()将返回 unicode 字符串的 unicode 字符使用,则需要使用。
回答by seriyPS
Dirty hack: repr(u'\u4132')will return "u'\\u4132'"
肮脏的黑客:repr(u'\u4132')会回来"u'\\u4132'"
回答by jfs
>>> c = u'\u4132'
>>> '%x' % ord(c)
'4132'
回答by Danilo Souza Mor?es
Pass the unicode character to ord()to get its code point and then break that code point into individual bytes with int.to_bytes()and then format the output however you want:
传递 unicode 字符以ord()获取其代码点,然后将该代码点分解为单个字节,int.to_bytes()然后根据需要格式化输出:
list(map(lambda b: hex(b)[2:], ord('\u4132').to_bytes(4, 'big')))
returns: ['0', '0', '41', '32']
返回: ['0', '0', '41', '32']
list(map(lambda b: hex(b)[2:], ord('\N{PILE OF POO}').to_bytes(4, 'big')))
returns: ['0', '1', 'f4', 'a9']
返回: ['0', '1', 'f4', 'a9']
As I have mentioned on another comment, encoding the code point to utf16 will not work as expected for code points outside the BMP (Basic Multilingual Plane) since UTF16 will need a surrogate pair to encode those code points.
正如我在另一条评论中提到的,对于 BMP(基本多语言平面)之外的代码点,将代码点编码为 utf16 将无法按预期工作,因为 UTF16 需要一个代理对来对这些代码点进行编码。

