postgresql UnicodeDecodeError:“ascii”编解码器无法解码位置 47 中的字节 0x92:序号不在范围内(128)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26619801/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-21 01:40:13  来源:igfitidea点击:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)

pythonpostgresqlpython-2.7encodingutf

提问by user3422637

I am trying to write data in a StringIO object using Python and then ultimately load this data into a postgres database using psycopg2's copy_from() function.

我正在尝试使用 Python 在 StringIO 对象中写入数据,然后最终使用 psycopg2 的 copy_from() 函数将此数据加载到 postgres 数据库中。

First when I did this, the copy_from() was throwing an error: ERROR: invalid byte sequence for encoding "UTF8": 0xc92 So I followed this question.

首先,当我这样做时,copy_from() 抛出一个错误:ERROR: invalid byte sequence for encoding "UTF8": 0xc92 所以我跟着这个问题

I figured out that my Postgres database has UTF8 encoding.

我发现我的 Postgres 数据库有 UTF8 编码。

The file/StringIO object I am writing my data into shows its encoding as the following: setgid Non-ISO extended-ASCII English text, with very long lines, with CRLF line terminators

我将数据写入的文件/StringIO 对象显示其编码如下: setgid 非 ISO 扩展 ASCII 英文文本,带有很长的行,带有 CRLF 行终止符

I tried to encode every string that I am writing to the intermediate file/StringIO object into UTF8 format. To do this used .encode(encoding='UTF-8',errors='strict')) for every string.

我尝试将写入中间文件/StringIO 对象的每个字符串编码为 UTF8 格式。为此,对每个字符串使用 .encode(encoding='UTF-8',errors='strict')) 。

This is the error I got now: UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)

这是我现在得到的错误:UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)

What does it mean? How do I fix it?

这是什么意思?我如何解决它?

EDIT: I am using Python 2.7 Some pieces of my code:

编辑:我正在使用 Python 2.7 我的一些代码片段:

I read from a MySQL database that has data encoded in UTF-8 as per MySQL Workbench. This is a few lines code for writing my data (that's obtained from MySQL db) to StringIO object:

我从 MySQL 数据库中读取数据,该数据库的数据按照 MySQL Workbench 以 UTF-8 编码。这是用于将我的数据(从 MySQL db 获得)写入 StringIO 对象的几行代码:

# Populate the table_data variable with rows delimited by \n and columns delimited by \t
row_num=0
for row in cursor.fetchall() :

    # Separate rows in a table by new line delimiter
    if(row_num!=0):
        table_data.write("\n")

    col_num=0
    for cell in row:    
        # Separate cells in a row by tab delimiter
        if(col_num!=0):
            table_data.write("\t") 

        table_data.write(cell.encode(encoding='UTF-8',errors='strict'))
        col_num = col_num+1

    row_num = row_num+1   

This is the code that writes to Postgres database from my StringIO object table_data:

这是从我的 StringIO 对象 table_data 写入 Postgres 数据库的代码:

cursor = db_connection.cursor()
cursor.copy_from(table_data, <postgres_table_name>)

回答by abarnert

The problem is that you're calling encodeon a strobject.

问题是你正在调用encode一个str对象。

A stris a byte string, usually representing text encoded in some way like UTF-8. When you call encodeon that, it first has to be decoded back to text, so the text can be re-encoded. By default, Python does that by calling s.decode(sys.getgetdefaultencoding()), and getdefaultencoding()usually returns 'ascii'.

Astr是一个字节字符串,通常表示以某种方式编码的文本,如 UTF-8。当您调用encode它时,首先必须将其解码回文本,以便可以重新编码文本。默认情况下,Python 通过调用来实现s.decode(sys.getgetdefaultencoding()),并且getdefaultencoding()通常返回'ascii'.

So, you're talking UTF-8 encoded text, decoding it as if it were ASCII, then re-encoding it in UTF-8.

因此,您是在谈论 UTF-8 编码的文本,将其解码为 ASCII,然后将其重新编码为 UTF-8。

The general solution is to explicitly call decodewith the right encoding, instead of letting Python use the default, and then encodethe result.

一般的解决方案是decode使用正确的编码显式调用,而不是让 Python 使用默认值,然后encode使用结果。

But when the right encoding is already the one you want, the easier solution is to just skip the .decode('utf-8').encode('utf-8')and just use the UTF-8 stras the UTF-8 strthat it already is.

但是当正确的编码已经是您想要的编码时,更简单的解决方案是跳过.decode('utf-8').encode('utf-8')UTF-8 并使用 UTF-8str作为str它已经是的 UTF-8 。

Or, alternatively, if your MySQL wrapper has a feature to let you specify an encoding and get back unicodevalues for CHAR/VARCHAR/TEXTcolumns instead of strvalues (e.g., in MySQLdb, you pass use_unicode=Trueto the connectcall, or charset='UTF-8'if your database is too old to auto-detect it), just do that. Then you'll have unicodeobjects, and you can call .encode('utf-8')on them.

或者,如果你的MySQL包装有一个功能,让你指定的编码,并取回unicodeCHAR/ VARCHAR/TEXT列,而不是str值(例如,在MySQLdb的,你传递use_unicode=Trueconnect呼叫,或者charset='UTF-8'如果你的数据库是太旧,自动-检测它),就这样做。然后你会有unicode对象,你可以调用.encode('utf-8')它们。

In general, the best way to deal with Unicode problems is the last one—decode everything as early as possible, do all the processing in Unicode, and then encode as late as possible. But either way, you have to be consistent. Don't call stron something that might be a unicode; don't concatenate a strliteral to a unicodeor pass one to its replacemethod; etc. Any time you mix and match, Python is going to implicitly convert for you, using your default encoding, which is almost never what you want.

通常,处理 Unicode 问题的最佳方法是最后一种方法——尽可能早地解码所有内容,以 Unicode 进行所有处理,然后尽可能晚地进行编码。但无论哪种方式,你都必须保持一致。不要调用str可能是unicode; 的东西。不要将str文字连接到 aunicode或将其传递给它的replace方法;等等。任何时候你混合和匹配,Python 都会为你隐式转换,使用你的默认编码,这几乎从来不是你想要的。

As a side note, this is one of the many things that Python 3.x's Unicode changes help with. First, stris now Unicode text, not encoded bytes. More importantly, if you haveencoded bytes, e.g., in a bytesobject, calling encodewill give you an AttributeErrorinstead of trying to silently decode so it can re-encode. And, similarly, trying to mix and match Unicode and bytes will give you an obvious TypeError, instead of an implicit conversion that succeeds in some cases and gives a cryptic message about an encode or decode you didn't ask for in others.

作为旁注,这是 Python 3.x 的 Unicode 更改所提供的众多帮助之一。首先,str现在是 Unicode 文本,而不是编码字节。更重要的是,如果您字节进行了编码,例如,在一个bytes对象中,调用encode将为您提供一个AttributeError而不是尝试静默解码以便它可以重新编码。并且,类似地,尝试混合和匹配 Unicode 和字节会给您一个明显的TypeError,而不是在某些情况下成功的隐式转换,并提供有关您在其他情况下没有要求的编码或解码的神秘信息。