在 Python 中解码双编码的 utf8

Question

提问by Chris Ciesielski

I've got a problem with strings that I get from one of my clients over xmlrpc. He sends me utf8 strings that are encoded twice :( so when I get them in python I have an unicode object that has to be decoded one more time, but obviously python doesn't allow that. I've noticed my client however I need to do quick workaround for now before he fixes it.

我通过 xmlrpc 从我的一个客户那里得到的字符串有问题。他向我发送了两次编码的 utf8 字符串 :( 所以当我在 python 中获取它们时，我有一个必须再解码一次的 unicode 对象，但显然 python 不允许这样做。我已经注意到我的客户，但是我需要在他修复它之前做快速的解决方法。

Raw string from tcp dump:

来自 tcp 转储的原始字符串：

<string>Rafa\xc3\x85\xc2\x82</string>

this is converted into:

这被转换为：

u'Rafa\xc5\x82'

The best we get is:

我们得到的最好的是：

eval(repr(u'Rafa\xc5\x82')[1:]).decode("utf8")

This results in correct string which is:

这会产生正确的字符串，即：

u'Rafa\u0142'

this works however is ugly as hell and cannot be used in production code. If anyone knows how to fix this problem in more suitable way please write. Thanks, Chris

然而，这很丑陋，不能在生产代码中使用。如果有人知道如何以更合适的方式解决这个问题，请写信。谢谢，克里斯

Answer 1

回答by Ivan Baldin

>>> s = u'Rafa\xc5\x82'
>>> s.encode('raw_unicode_escape').decode('utf-8')
u'Rafa\u0142'
>>>

Answer 2

回答by RichieHindle

Yow, that was fun!

哇，那很有趣！

>>> original = "Rafa\xc3\x85\xc2\x82"
>>> first_decode = original.decode('utf-8')
>>> as_chars = ''.join([chr(ord(x)) for x in first_decode])
>>> result = as_chars.decode('utf-8')
>>> result
u'Rafa\u0142'

So you do the first decode, getting a Unicode string where each character is actually a UTF-8 byte value. You go via the integer value of each of those characters to get back to a genuine UTF-8 string, which you then decode as normal.

所以你进行第一次解码，得到一个 Unicode 字符串，其中每个字符实际上是一个 UTF-8 字节值。您通过每个字符的整数值返回一个真正的 UTF-8 字符串，然后您可以正常解码。

Answer 3

回答by John Machin

>>> weird = u'Rafa\xc5\x82'
>>> weird.encode('latin1').decode('utf8')
u'Rafa\u0142'
>>>

latin1 is just an abbreviation for Richie's nuts'n'bolts method.

latin1 只是 Richie's nut'n'bolts 方法的缩写。

It is very curious that the seriously under-described raw_unicode_escapecodec gives the same result as latin1in this case. Do they always give the same result? If so, why have such a codec? If not, it would preferable to know for sure exactly how the OP's client did the transformation from 'Rafa\xc5\x82'to u'Rafa\xc5\x82'and then to reverse that process exactly -- otherwise we might come unstuck if different data crops up before the double encoding is fixed.

非常奇怪的是，被严重低估的raw_unicode_escape编解码器给出了与latin1本例相同的结果。他们总是给出相同的结果吗？如果是这样，为什么要有这样的编解码器？如果没有，最好确切地知道 OP 的客户端是如何从'Rafa\xc5\x82'to进行转换u'Rafa\xc5\x82'，然后完全反转该过程——否则，如果在修复双重编码之前出现不同的数据，我们可能会陷入困境。

在 Python 中解码双编码的 utf8

提问by Chris Ciesielski

回答by Ivan Baldin

回答by RichieHindle

回答by John Machin

相关推荐

最近更新

标签

在 Python 中解码双编码的 utf8

提问by Chris Ciesielski

回答by Ivan Baldin

回答by RichieHindle

回答by John Machin

相关推荐

python 用于缓解 UTF-8 问题的 ElementTree 的替代 XML 解析器？

在 Python 中，如何获取当前帧？

python 从文本中解析含义

python 如何将 Sphinx 的 Autodoc 扩展用于私有方法？

相关推荐

最近更新

标签