Python UnicodeEncodeError: 'ascii' 编解码器无法编码字符 u'\xa3'

Question

提问by AP257

I have an Excel spreadsheet that I'm reading in that contains some ￡ signs.

我有一个 Excel 电子表格，我正在阅读其中包含一些￡符号。

When I try to read it in using the xlrd module, I get the following error:

当我尝试使用 xlrd 模块读取它时，出现以下错误：

x = table.cell_value(row, col)
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)

If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the ￡ signs have all become garbled.

如果我将其重写为 x.encode('utf-8') 它会停止抛出错误，但不幸的是，当我将数据写到其他地方（如 latin-1）时，￡符号都变得乱码了。

How can I fix this, and read the ￡ signs in correctly?

我该如何解决这个问题，并正确阅读￡符号？

--- UPDATE ---

- - 更新 - -

Some kind readers have suggested that I don't need to decode it at all, or that I can just encode it to Latin-1 when I need to. The problem with this is that I need to write the data to a CSV file eventually, and it seems to object to the raw strings.

一些好心的读者建议我根本不需要解码它，或者我可以在需要时将其编码为 Latin-1。这样做的问题是我最终需要将数据写入 CSV 文件，而且它似乎反对原始字符串。

If I don't encode or decode the data at all, then this happens (after I've added the string to an array called items):

如果我根本不编码或解码数据，则会发生这种情况（在我将字符串添加到名为 items 的数组之后）：

for item in items:
    #item = [x.encode('latin-1') for x in item]
    cleancsv.writerow(item)
File "clean_up_barnet.py", line 104, in <module>
 cleancsv.writerow(item)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 43: ordinal not in range(128)

I get the same error even if I uncomment the Latin-1 line.

即使我取消对 Latin-1 行的注释，我也会收到同样的错误。

Answer 1

采纳答案by Alex Martelli

Your code snippet says x.decode, but you're getting an encodeerror -- meaning xis Unicode already, so, to "decode" it, it must be first turned into a string of bytes (and that's where the default codec ansicomes up and fails). In your text then you say "if I rewrite ot to x.encode"... which seems to imply that you doknow x is Unicode.

你的代码片段说x.decode，但是你得到了一个编码错误——这意味着x已经是 Unicode，所以，要“解码”它，它必须首先变成一串字节（这就是默认编解码器ansi出现并失败的地方） . 在你的文本，然后你说：“如果我重写OT对x。编码” ......这似乎意味着，你不知道X是Unicode。

So what it IS you're doing -- and what it is you meanto be doing -- encoding a unicode xto get a coded string of bytes, or decoding a string of bytes into a unicode object?

那么你在做什么——你的意思是什么——编码一个unicodex以获得一个编码的字节串，或者将一个字节串解码成一个unicode对象？

I find it unfortunate that you can call encodeon a byte string, and decodeon a unicode object, because I find it seems to lead users to nothing but confusion... but at least in this case you seem to manage to propagate the confusion (at least to me;-).

我发现很遗憾您可以调用encode字节字符串和decodeunicode 对象，因为我发现它似乎只会让用户感到困惑……但至少在这种情况下，您似乎设法传播了这种困惑（在至少对我而言；-)。

If, as it seems, xis unicode, then you never want to "decode" it -- you may want to encodeit to get a byte string with a certain codec, e.g. latin-1, if that's what you need for some kind of I/O purposes (for your own internal program use I recommend sticking with unicode all the time -- only encode/decode if and when you absolutely need, or receive, coded byte strings for input / output purposes).

如果看起来x是 unicode，那么你永远不想“解码”它——你可能想用某个编解码器对它进行编码以获得一个字节字符串，例如 latin-1，如果这是你需要的某种I/O 目的（对于您自己的内部程序使用，我建议始终坚持使用 unicode——仅在您绝对需要或接收编码字节字符串以用于输入/输出目的时才进行编码/解码）。

Answer 2

回答by Katriel

xlrdworks with Unicode, so the string you get back is a Unicode string. The ￡-sign has code point U+00A3, so the representation of said string should be u'\xa3'. This has been read in correctly; it is the string that you should be working with throughout your program.

xlrd使用 Unicode，因此您返回的字符串是 Unicode 字符串。￡符号的代码点为 U+00A3，因此该字符串的表示应为u'\xa3'。这已被正确读入；它是您应该在整个程序中使用的字符串。

When you write this (abstract, Unicode) string somewhere, you need to choose an encoding. At that point, you should .encodeit into that encoding, say latin-1.

当你在某处写这个（抽象的，Unicode）字符串时，你需要选择一种编码。那时，您应该将.encode其转换为该编码，例如latin-1.

>>> book = xlrd.open_workbook( "test.xls" )
>>> sh = book.sheet_by_index( 0 )
>>> x = sh.cell_value( 0, 0 )
>>> x
u'\xa3'
>>> print x
￡

# sample outputs (for e.g. writing to a file)
>>> x.encode( "latin-1" )
'\xa3'
>>> x.encode( "utf-8" )
'\xc2\xa3'

# garbage, because x is already Unicode
>>> x.decode( "ascii" )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0:
ordinal not in range(128)
>>>

Answer 3

回答by John Machin

For what it's worth: I'm the author of xlrd.

对于它的价值：我是xlrd.

Does xlrdproduce unicode?
Option 1: Read the Unicode section at the bottom of the first screenful of xlrddoc: This module presents all text strings as Python unicode objects.
Option 2: print type(text), repr(text)

是否xlrd产生unicode？
选项 1：阅读第一屏xlrddoc底部的 Unicode 部分：此模块将所有文本字符串显示为 Python unicode 对象。
选项 2：print type(text), repr(text)

You say """If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the ￡ signs have all become garbled.""" Of course if you write UTF-8-encoded text to a device that's expecting latin1, it will be garbled. What do did you expect?

你说"""如果我将它重写为 x.encode('utf-8') 它会停止抛出错误，但不幸的是，当我将数据写在其他地方（如 latin-1）时，￡符号都变成了乱码。""" 当然，如果你把 UTF-8 编码的文本写到一个需要 latin1 的设备上，它就会出现乱码。你期待什么？

You say in your edit: """I get the same error even if I uncomment the Latin-1 line""". This is very unlikely -- much more likely is that you got a slightly different error (mentioning the latin1 codec instead of the ascii codec) in a different source line (the uncommented latin1 line instead of the writerow line). Reading error messages carefully aids understanding.

您在编辑中说："""即使我取消注释 Latin-1 行"""，我也会收到相同的错误。这不太可能 - 更有可能的是您在不同的源代码行（未注释的 latin1 行而不是 writerow 行）中遇到了稍微不同的错误（提到 latin1 编解码器而不是 ascii 编解码器）。仔细阅读错误消息有助于理解。

Your problem here is that in general your data is NOT encodable in latin1; very little real-world data is. Your POUND SIGN is encodable in latin1, but that's not all your non-ASCII data. The problematic character is U+2022 BULLET which is not encodable in latin1.

您的问题是，通常您的数据不能用 latin1 编码；很少有真实世界的数据。您的 POUND SIGN 可在 latin1 中编码，但这并不是您的所有非 ASCII 数据。有问题的字符是 U+2022 BULLET，它无法在 latin1 中编码。

It would have helped you get a better answer sooner if you had mentioned up front that you were working on Mac OS X ... the usual suspect for a CSV-suitable encoding is cp1252(Windows), not mac-roman.

如果您事先提到您在 Mac OS X 上工作，它会帮助您更快地获得更好的答案……适用于 CSV 的编码的通常嫌疑人是cp1252(Windows)，而不是mac-roman.

Answer 4

回答by dan04

x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)

Look closely: You got a Unicode***Encode***Error calling the decodemethod.

仔细观察：调用decode方法时出现 Unicode***Encode***Error 。

The reason for this is that decodeis intended to convert from a byte sequence (str) to a unicodeobject. But, as John said, xlrdalready uses Unicode strings, so xis already a unicodeobject.

这样做的原因decode是旨在将字节序列 ( str) 转换为unicode对象。但是，正如约翰所说，xlrd已经使用了 Unicode 字符串，因此x已经是一个unicode对象。

In this situation, Python 2.x assumes that you meantto decode a strobject, so it "helpfully" creates one for you. But in order to convert a unicodeto a str, it needs an encoding, and chooses ASCII because it's the lowest common denominator of character encodings. Your code effectively gets interpreted as

在这种情况下，Python 2.x 假定您打算解码一个str对象，因此它“有帮助地”为您创建一个对象。但是为了将 a 转换unicode为 a str，它需要一个编码，并选择 ASCII 因为它是字符编码的最小公分母。您的代码有效地被解释为

x = x.encode('ascii').decode("ISO-8859-1")

which fails because xcontains a non-ASCII character.

失败是因为x包含非 ASCII 字符。

Since xis already a unicodeobject, the decodeis unnecessary. However, now you run into the problem that the Python 2.x csvmodule doesn't support Unicode. You have to convert your data to strobjects.

既然x已经是一个unicode对象，decode就没有必要了。但是，现在您遇到了 Python 2.xcsv模块不支持 Unicode 的问题。您必须将数据转换为str对象。

for item in items:
    item = [x.encode('latin-1') for x in item]
    cleancsv.writerow(item)

This would be correct, except that you have the ?character (U+2022 BULLET) in your data, and Latin-1 can't represent it. There are several ways around this problem:

这是正确的，只是您的数据中有?字符 (U+2022 BULLET)，而 Latin-1 不能表示它。有几种方法可以解决这个问题：

Write x.encode('latin-1', 'ignore')to remove the bullet (or other non-Latin-1 characters).
Write x.encode('latin-1', 'replace')to replace the bullet with a question mark.
Replace the bullets with a Latin-1 character like *or ·.
Use a character encoding that doescontain all the characters you need.

写入x.encode('latin-1', 'ignore')以删除项目符号（或其他非拉丁 1 字符）。
写入x.encode('latin-1', 'replace')以用问号替换项目符号。
将项目符号替换为拉丁文 1 字符，例如*或·。
使用字符编码确实都含有你所需要的字符。

These days, UTF-8 is widely supported, so there is little reason to use any other encoding for text files.

如今，UTF-8 得到广泛支持，因此几乎没有理由对文本文件使用任何其他编码。

Answer 5

回答by jturnbull

A very easy way around all the "'ascii' codec can't encode character…" issues with csvwriter is to instead use unicodecsv, a drop-in replacement for csvwriter.

解决所有“'ascii' 编解码器无法编码字符...”问题的一个非常简单的方法是使用unicodecsv代替 csvwriter 。

Install unicodecsv with pip and then you can use it in the exact same way, eg:

使用 pip 安装 unicodecsv，然后您可以以完全相同的方式使用它，例如：

import unicodecsv
file = open('users.csv', 'w')
w = unicodecsv.writer(file)
for user in User.objects.all().values_list('first_name', 'last_name', 'email', 'last_login'):
    w.writerow(user)

Answer 6

回答by user1866080

Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.

使用 xlrd，我在一行中 ...xl_data.find(str(cell_value))... 这给出了错误：“'ascii' codec can't encode character u'\xdf' in position 3: ordinal not范围内（128）”。论坛中的所有建议对我的德语单词都没用。但更改为： ...xl_data.find(cell.value)... 没有错误。所以，我想在 xldr 的某些命令中使用字符串作为参数有特定的编码问题。

Python UnicodeEncodeError: 'ascii' 编解码器无法编码字符 u'\xa3'

提问by AP257

采纳答案by Alex Martelli

回答by Katriel

回答by John Machin

回答by dan04

回答by jturnbull

回答by user1866080

相关推荐

最近更新

标签

Python UnicodeEncodeError: 'ascii' 编解码器无法编码字符 u'\xa3'

提问by AP257

采纳答案by Alex Martelli

回答by Katriel

回答by John Machin

回答by dan04

回答by jturnbull

回答by user1866080

相关推荐

Python imshow(img, cmap=cm.gray) 显示 128 值的白色

Python 字符串中数字的总和

Python 如何调用numpy数组中的元素？

Python：如何使用 .split 命令计算句子中的平均词长？

相关推荐

最近更新

标签