Python:将 Unicode 转换为 ASCII 没有错误的 CSV 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4650639/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 16:45:20  来源:igfitidea点击:

Python: Convert Unicode to ASCII without errors for CSV file

pythonunicodecsvasciidiacritics

提问by Sergi

I've been reading all questions regarding conversion from Unicode to CSV in Python here in StackOverflow and I'm still lost. Everytime I receive a "UnicodeEncodeError: 'ascii' codec can't encode character u'\xd1' in position 12: ordinal not in range(128)"

我一直在阅读有关 StackOverflow 中 Python 中从 Unicode 到 CSV 转换的所有问题,但我仍然迷失了方向。每次我收到“UnicodeEncodeError: 'ascii' codec can't encode character u'\xd1' in position 12: ordinal not in range(128)”

buffer=cStringIO.StringIO()
writer=csv.writer(buffer, csv.excel)
cr.execute(query, query_param)
while (1):
    row = cr.fetchone()
    writer.writerow([s.encode('ascii','ignore') for s in row])

The value of rowis

(56, u"LIMPIADOR BA\xd1O 1'5 L")

where the value of \xd10 at the database is ?, a n with a diacritical tilde used in Spanish. At first I tried to convert the value to something valid in ascii, but after losing so much time I'm trying only to ignore those characters (I suppose I'd have the same problem with accented vowels).

数据库中 \xd10 的值是 ?,带有西班牙语中使用的变音波浪号。起初我试图将值转换为 ascii 中有效的值,但在失去了这么多时间之后,我只想忽略这些字符(我想我对重音元音也有同样的问题)。

I'd like to save the value to the CSV, preferably with the ? ("LIMPIADOR BA?O 1'5 L"), but if not possible, at least be able to save it ("LIMPIADOR BAO 1'5 L").

我想将值保存到 CSV,最好使用 ? (“LIMPIADOR BA?O 1'5 L”),但如果不可能,至少可以保存它(“LIMPIADOR BAO 1'5 L”)。

采纳答案by Lennart Regebro

Correct, ? is not a valid ASCII character, so you can't encode it to ASCII. So you can, as your code does above, ignore them. Another way, namely to remove the accents, you can find here: What is the best way to remove accents in a Python unicode string?

正确的, ?不是有效的 ASCII 字符,因此您无法将其编码为 ASCII。因此,您可以像上面的代码一样忽略它们。另一种方法,即删除重音,您可以在这里找到: 删除 Python unicode 字符串中的重音的最佳方法是什么?

But note that both techniques can result in bad effects, like making words actually mean something different, etc. So the best is to keep the accents. And then you can't use ASCII, but you can use another encoding. UTF-8 is the safe bet. Latin-1 or ISO-88591-1 is common one, but it includes only Western European characters. CP-1252 is common on Windows, etc, etc.

但请注意,这两种技术都会导致不良影响,例如使单词实际上具有不同的含义等。因此最好的方法是保留重音。然后你不能使用 ASCII,但你可以使用另一种编码。UTF-8 是安全的选择。Latin-1 或 ISO-88591-1 是常见的一种,但它仅包含西欧字符。CP-1252 在 Windows 等上很常见。

So just switch "ascii" for whatever encoding you want.

因此,只需将“ascii”切换为您想要的任何编码。



Your actual code, according to your comment is:

根据您的评论,您的实际代码是:

writer.writerow([s.encode('utf8') if type(s) is unicode else s for s in row]) 

where

在哪里

row = (56, u"LIMPIADOR BA\xd1O 1'5 L")

Now, I believe that should work, but apparently it doesn't. I think unicode gets passed into the cvs writer by mistake anyway. Unwrap that long line to it's parts:

现在,我相信这应该可行,但显然不行。我认为 unicode 无论如何都会错误地传递给 cvs 编写器。将那条长线解开到它的各个部分:

col1, col2 = row # Use the names of what is actually there instead
row = col1, col2.encode('utf8')
writer.writerow(row) 

Now your real error will not be hidden by the fact that you stick everything in the same line. This could also probably have been avoided if you had included a proper traceback.

现在,您将所有内容都放在同一行中这一事实不会隐藏您真正的错误。如果您包含了适当的回溯,这也可以避免。