UTF-8 latin-1 转换问题,python django
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/274361/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UTF-8 latin-1 conversion issues, python django
提问by jacob
ok so my issue is i have the string '\222\222\223\225' which is stored as latin-1 in the db. What I get from django (by printing it) is the following string, 'aaaa¢' which I assume is the UTF conversion of it. Now I need to pass the string into a function that does this operation:
好的,所以我的问题是我有字符串 '\222\222\223\225' 它在数据库中存储为 latin-1。我从 django 得到的(通过打印它)是以下字符串,'aaaa¢',我认为它是它的 UTF 转换。现在我需要将字符串传递给执行此操作的函数:
strdecryptedPassword + chr(ord(c) - 3 - intCounter - 30)
I get this error:
我收到此错误:
chr() arg not in range(256)
chr() arg 不在范围内 (256)
If I try to encode the string as latin-1 first I get this error:
如果我首先尝试将字符串编码为 latin-1,则会收到此错误:
'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)
“latin-1”编解码器无法对位置 0-3 中的字符进行编码:序号不在范围内 (256)
I have read a bunch on how character encoding works, and there is something I am missing because I just don't get it!
我已经阅读了很多关于字符编码如何工作的文章,但我缺少一些东西,因为我不明白!
回答by Vinko Vrsalovic
Your first error 'chr() arg not in range(256)' probably means you have underflowed the value, because chr cannot take negative numbers. I don't know what the encryption algorithm is supposed to do when the inputcounter + 33 is more than the actual character representation, you'll have to check what to do in that case.
您的第一个错误 'chr() arg not in range(256)' 可能意味着您的值下溢,因为 chr 不能取负数。当 inputcounter + 33 大于实际字符表示时,我不知道加密算法应该做什么,您必须检查在这种情况下该怎么做。
About the second error. you must decode() and not encode() a regular string object to get a proper representation of your data. encode() takes a unicode object (those starting with u') and generates a regular string to be output or written to a file. decode() takes a string object and generate a unicode object with the corresponding code points. This is done with the unicode() call when generated from a string object, you could also call a.decode('latin-1') instead.
关于第二个错误。您必须 decode() 而不是 encode() 常规字符串对象才能正确表示数据。encode() 接受一个 unicode 对象(那些以 u' 开头的对象)并生成要输出或写入文件的常规字符串。decode() 接受一个字符串对象并生成一个带有相应代码点的 unicode 对象。当从字符串对象生成时,这是通过 unicode() 调用完成的,您也可以调用 a.decode('latin-1') 代替。
>>> a = '2235'
>>> u = unicode(a,'latin-1')
>>> u
u'\x92\x92\x93\x95'
>>> print u.encode('utf-8')
????
>>> print u.encode('utf-16')
?t
>>> print u.encode('latin-1')
>>> for c in u:
... print chr(ord(c) - 3 - 0 -30)
...
q
q
r
t
>>> for c in u:
... print chr(ord(c) - 3 -200 -30)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
ValueError: chr() arg not in range(256)
回答by Jonathan Leffler
As Vinko notes, Latin-1 or ISO 8859-1 doesn't have printable characters for the octal string you quote. According to my notes for 8859-1, "C1 Controls (0x80 - 0x9F) are from ISO/IEC 6429:1992. It does not define names for 80, 81, or 99". The code point names are as Vinko lists them:
正如 Vinko 所指出的,Latin-1 或 ISO 8859-1 没有您引用的八进制字符串的可打印字符。根据我对 8859-1 的说明,“C1 Controls (0x80 - 0x9F) 来自 ISO/IEC 6429:1992。它没有定义 80、81 或 99 的名称”。代码点名称与 Vinko 列出的一样:
2 = 0x92 => PRIVATE USE TWO
3 = 0x93 => SET TRANSMIT STATE
5 = 0x95 => MESSAGE WAITING
The correct UTF-8 encoding of those is (Unicode, binary, hex):
正确的 UTF-8 编码是(Unicode、二进制、十六进制):
U+0092 = %11000010 %10010010 = 0xC2 0x92
U+0093 = %11000010 %10010011 = 0xC2 0x93
U+0095 = %11000010 %10010101 = 0xC2 0x95
The LATIN SMALL LETTER A WITH CIRCUMFLEX is ISO 8859-1 code 0xE2 and hence Unicode U+00E2; in UTF-8, that is %11000011 %10100010 or 0xC3 0xA2.
带圆圈的拉丁文小写字母 A 是 ISO 8859-1 代码 0xE2,因此是 Unicode U+00E2;在 UTF-8 中,即 %11000011 %10100010 或 0xC3 0xA2。
The CENT SIGN is ISO 8859-1 code 0xA2 and hence Unicode U+00A2; in UTF-8, that is %11000011 %10000010 or 0xC3 0x82.
CENT SIGN 是 ISO 8859-1 代码 0xA2,因此是 Unicode U+00A2;在 UTF-8 中,即 %11000011 %10000010 或 0xC3 0x82。
So, whatever else you are seeing, you do not seem to be seeing a UTF-8 encoding of ISO 8859-1. All else apart, you are seeing but 5 bytes where you would have to see 8.
因此,无论您看到什么,您似乎都没有看到 ISO 8859-1 的 UTF-8 编码。除此之外,您只能看到 5 个字节,而您必须看到 8 个字节。
Added: The previous part of the answer addresses the 'UTF-8 encoding' claim, but ignores the rest of the question, which says:
添加:答案的前一部分解决了“UTF-8 编码”声明,但忽略了问题的其余部分,其中说:
Now I need to pass the string into a function that does this operation:
strdecryptedPassword + chr(ord(c) - 3 - intCounter - 30)
I get this error: chr() arg not in range(256). If I try to encode the
string as Latin-1 first I get this error: 'latin-1' codec can't encode
characters in position 0-3: ordinal not in range(256).
You don't actually show us how intCounter is defined, but if it increments gently per character, sooner or later 'ord(c) - 3 - intCounter - 30
' is going to be negative (and, by the way, why not combine the constants and use 'ord(c) - intCounter - 33
'?), at which point, chr()
is likely to complain. You would need to add 256 if the value is negative, or use a modulus operation to ensure you have a positive value between 0 and 255 to pass to chr()
. Since we can't see how intCounter is incremented, we can't tell if it cycles from 0 to 255 or whether it increases monotonically. If the latter, then you need an expression such as:
您实际上并没有向我们展示 intCounter 是如何定义的,但是如果它按字符缓慢递增,那么 ' ord(c) - 3 - intCounter - 30
'迟早会变成负数(顺便说一句,为什么不组合常量并使用 ' ord(c) - intCounter - 33
'?),在哪一点,chr()
很可能会抱怨。如果值为负,则需要添加 256,或者使用模数运算来确保将 0 到 255 之间的正值传递给chr()
。由于我们看不到 intCounter 是如何递增的,因此我们无法判断它是从 0 到 255 循环还是单调递增。如果是后者,那么您需要一个表达式,例如:
chr(mod(ord(c) - mod(intCounter, 255) + 479, 255))
where 256 - 33 = 223, of course, and 479 = 256 + 223. This guarantees that the value passed to chr()
is positive and in the range 0..255 for any input character c and any value of intCounter (and, because the mod()
function never gets a negative argument, it also works regardless of how mod()
behaves when its arguments are negative).
其中 256 - 33 = 223,当然,479 = 256 + 223。这保证传递给的值chr()
是正数,并且对于任何输入字符 c 和任何 intCounter 值都在 0..255 范围内(并且,因为mod()
函数永远不会得到否定的论点,无论mod()
其论点为否定时的行为如何,它也都有效)。
回答by jacob
Well its because its been encrypted with some terrible scheme that just changes the ord() of the character by some request, so the string coming out of the database has been encrypted and this decrypts it. What you supplied above does not seem to work. In the database it is latin-1, django converts it to unicode, but I cannot pass it to the function as unicode, but when i try and encode it to latin-1 i see that error.
嗯,因为它是用一些可怕的方案加密的,它只是通过一些请求改变了字符的 ord(),所以从数据库中出来的字符串已经被加密,这对它进行了解密。您上面提供的内容似乎不起作用。在数据库中它是 latin-1,django 将它转换为 unicode,但我不能将它作为 unicode 传递给函数,但是当我尝试将它编码为 latin-1 时,我看到了那个错误。