在 Python 中解码双编码的 utf8
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1177316/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Decoding double encoded utf8 in Python
提问by Chris Ciesielski
I've got a problem with strings that I get from one of my clients over xmlrpc. He sends me utf8 strings that are encoded twice :( so when I get them in python I have an unicode object that has to be decoded one more time, but obviously python doesn't allow that. I've noticed my client however I need to do quick workaround for now before he fixes it.
我通过 xmlrpc 从我的一个客户那里得到的字符串有问题。他向我发送了两次编码的 utf8 字符串 :( 所以当我在 python 中获取它们时,我有一个必须再解码一次的 unicode 对象,但显然 python 不允许这样做。我已经注意到我的客户,但是我需要在他修复它之前做快速的解决方法。
Raw string from tcp dump:
来自 tcp 转储的原始字符串:
<string>Rafa\xc3\x85\xc2\x82</string>
this is converted into:
这被转换为:
u'Rafa\xc5\x82'
The best we get is:
我们得到的最好的是:
eval(repr(u'Rafa\xc5\x82')[1:]).decode("utf8")
This results in correct string which is:
这会产生正确的字符串,即:
u'Rafa\u0142'
this works however is ugly as hell and cannot be used in production code. If anyone knows how to fix this problem in more suitable way please write. Thanks, Chris
然而,这很丑陋,不能在生产代码中使用。如果有人知道如何以更合适的方式解决这个问题,请写信。谢谢,克里斯
回答by Ivan Baldin
>>> s = u'Rafa\xc5\x82' >>> s.encode('raw_unicode_escape').decode('utf-8') u'Rafa\u0142' >>>
回答by RichieHindle
Yow, that was fun!
哇,那很有趣!
>>> original = "Rafa\xc3\x85\xc2\x82"
>>> first_decode = original.decode('utf-8')
>>> as_chars = ''.join([chr(ord(x)) for x in first_decode])
>>> result = as_chars.decode('utf-8')
>>> result
u'Rafa\u0142'
So you do the first decode, getting a Unicode string where each character is actually a UTF-8 byte value. You go via the integer value of each of those characters to get back to a genuine UTF-8 string, which you then decode as normal.
所以你进行第一次解码,得到一个 Unicode 字符串,其中每个字符实际上是一个 UTF-8 字节值。您通过每个字符的整数值返回一个真正的 UTF-8 字符串,然后您可以正常解码。
回答by John Machin
>>> weird = u'Rafa\xc5\x82'
>>> weird.encode('latin1').decode('utf8')
u'Rafa\u0142'
>>>
latin1 is just an abbreviation for Richie's nuts'n'bolts method.
latin1 只是 Richie's nut'n'bolts 方法的缩写。
It is very curious that the seriously under-described raw_unicode_escape
codec gives the same result as latin1
in this case. Do they always give the same result? If so, why have such a codec? If not, it would preferable to know for sure exactly how the OP's client did the transformation from 'Rafa\xc5\x82'
to u'Rafa\xc5\x82'
and then to reverse that process exactly -- otherwise we might come unstuck if different data crops up before the double encoding is fixed.
非常奇怪的是,被严重低估的raw_unicode_escape
编解码器给出了与latin1
本例相同的结果。他们总是给出相同的结果吗?如果是这样,为什么要有这样的编解码器?如果没有,最好确切地知道 OP 的客户端是如何从'Rafa\xc5\x82'
to进行转换u'Rafa\xc5\x82'
,然后完全反转该过程——否则,如果在修复双重编码之前出现不同的数据,我们可能会陷入困境。