使用 Python 在 Pandas 中读取 CSV 文件时出现 UnicodeDecodeError

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18171739/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:02:28  来源:igfitidea点击:

UnicodeDecodeError when reading CSV file in Pandas with Python

pythonpandascsvdataframeunicode

提问by TravisVOX

I'm running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error...

我正在运行一个正在处理 30,000 个类似文件的程序。他们中的随机数量正在停止并产生此错误......

   File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
     data = pd.read_csv(filepath, names=fields)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
     return _read(filepath_or_buffer, kwds)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
     return parser.read()
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
     ret = self._engine.read(nrows)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
     data = self._reader.read(nrows)
   File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
   File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
   File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
   File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
   File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
   File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
   File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
   File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte

The source/creation of these files all come from the same place. What's the best way to correct this to proceed with the import?

这些文件的来源/创建都来自同一个地方。纠正此问题以继续导入的最佳方法是什么?

采纳答案by Stefan

read_csvtakes an encodingoption to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1"), or alternatively encoding = "utf-8"for reading, and generally utf-8for to_csv.

read_csv需要一个encoding选项来处理不同格式的文件。我主要使用read_csv('file', encoding = "ISO-8859-1"), 或者encoding = "utf-8"用于阅读,通常utf-8用于to_csv.

You can also use one of several aliasoptions like 'latin'instead of 'ISO-8859-1'(see python docs, also for numerous other encodings you may encounter).

您还可以使用多个alias选项之一,例如'latin'代替'ISO-8859-1'(请参阅python 文档,也可以使用您可能遇到的许多其他编码)。

See relevant Pandas documentation, python docs examples on csv files, and plenty of related questions here on SO. A good background resource is What every developer should know about unicode and character sets.

请参阅相关的 Pandas 文档关于 csv 文件的 python 文档示例以及关于 SO 的大量相关问题。一个很好的背景资源是每个开发人员都应该了解的有关 unicode 和字符集的知识

To detect the encoding (assuming the file contains non-ascii characters), you can use enca(see man page) or file -i(linux) or file -I(osx) (see man page).

要检测编码(假设文件包含非 ascii 字符),您可以使用enca(see man page) or file -i(linux) or file -I(osx) (see man page)。

回答by Gil Baggio

Simplest of all Solutions:

最简单的解决方案:

import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')

Alternate Solution:

替代解决方案:

  • Open the csv file in Sublime text editor.
  • Save the file in utf-8 format.
  • Sublime 文本编辑器中打开 csv 文件。
  • 以 utf-8 格式保存文件。

In sublime, Click File -> Save with encoding -> UTF-8

在 sublime 中,单击 File -> Save with encoding -> UTF-8

Then, you can read your file as usual:

然后,您可以像往常一样读取文件:

import pandas as pd
data = pd.read_csv('file_name.csv', encoding='utf-8')

and the other different encoding types are:

其他不同的编码类型是:

encoding = "cp1252"
encoding = "ISO-8859-1"

回答by J. Ternent

Struggled with this a while and thought I'd post on this question as it's the first search result. Adding the encoding="iso-8859-1"tag to pandas read_csvdidn't work, nor did any other encoding, kept giving a UnicodeDecodeError.

挣扎了一段时间,我想我会在这个问题上发帖,因为它是第一个搜索结果。将encoding="iso-8859-1"标签添加到 Pandasread_csv不起作用,任何其他编码也不起作用,不断给出 UnicodeDecodeError。

If you're passing a file handle to pd.read_csv(),you need to put the encodingattribute on the file open, not in read_csv. Obvious in hindsight, but a subtle error to track down.

如果您将文件句柄传递给pd.read_csv(),您,则需要将该encoding属性放在打开的文件上,而不是放在read_csv. 事后看来很明显,但要追踪一个微妙的错误。

回答by Serge Ballesta

Pandas allows to specify encoding, but does not allow to ignore errors not to automatically replace the offending bytes. So there is no one size fits allmethod but different ways depending on the actual use case.

Pandas 允许指定编码,但不允许忽略错误而不自动替换有问题的字节。因此,没有一种适合所有方法的方法,而是根据实际用例使用不同的方法。

  1. You know the encoding, and there is no encoding error in the file. Great: you have just to specify the encoding:

    file_encoding = 'cp1252'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
    
  2. You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use Latin1encoding because it accept any possible byte as input (and convert it to the unicode character of same code):

    pd.read_csv(input_file_and_path, ..., encoding='latin1')
    
  3. You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python openfunction has (assuming Python3), and read_csvaccepts a file like object. Typical errors parameter to use here are 'ignore'which just suppresses the offending bytes or (IMHO better) 'backslashreplace'which replaces the offending bytes by their Python's backslashed escape sequence:

    file_encoding = 'utf8'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace')
    pd.read_csv(input_fd, ...)
    
  1. 你知道编码,文件中没有编码错误。太好了:您只需指定编码:

    file_encoding = 'cp1252'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
    
  2. 你不想被编码问题所困扰,只想加载那个该死的文件,不管某些文本字段是否包含垃圾。好的,您只需要使用Latin1编码,因为它接受任何可能的字节作为输入(并将其转换为相同代码的 unicode 字符):

    pd.read_csv(input_file_and_path, ..., encoding='latin1')
    
  3. 您知道大部分文件都是用特定的编码编写的,但它也包含编码错误。一个真实的例子是一个 UTF8 文件,它已经用非 utf8 编辑器编辑过,其中包含一些具有不同编码的行。Pandas 没有提供特殊的错误处理,但 Pythonopen函数有(假设 Python3),并read_csv接受像 object. 此处使用的典型错误参数'ignore'只是抑制有问题的字节或(恕我直言更好)'backslashreplace'用 Python 的反斜杠转义序列替换有问题的字节:

    file_encoding = 'utf8'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace')
    pd.read_csv(input_fd, ...)
    

回答by bhavesh

with open('filename.csv') as f:
   print(f)

after executing this code you will find encoding of 'filename.csv' then execute code as following

执行此代码后,您将找到“filename.csv”的编码,然后执行如下代码

data=pd.read_csv('filename.csv', encoding="encoding as you found earlier"

there you go

你去吧

回答by nbwoodward

This answer seems to be the catch-all for CSV encoding issues. If you are getting a strange encoding problem with your header like this:

这个答案似乎是 CSV 编码问题的全部。如果您的标题出现奇怪的编码问题,如下所示:

>>> f = open(filename,"r")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('\ufeffid', '1'), ... ])

Then you have a byte order mark (BOM) character at the beginning of your CSV file. This answer addresses the issue:

然后在 CSV 文件的开头有一个字节顺序标记 (BOM) 字符。这个答案解决了这个问题:

Python read csv - BOM embedded into the first key

Python读取csv-嵌入到第一个键中的BOM

The solution is to load the CSV with encoding="utf-8-sig":

解决方案是使用以下命令加载 CSV encoding="utf-8-sig"

>>> f = open(filename,"r", encoding="utf-8-sig")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('id', '1'), ... ])

Hopefully this helps someone.

希望这有助于某人。

回答by Vodyanikov Andrew Anatolevich

In my case, a file has USC-2 LE BOMencoding, according to Notepad++. It is encoding="utf_16_le"for python.

就我而言,USC-2 LE BOM根据 Notepad++ ,文件具有编码。它适用encoding="utf_16_le"于蟒蛇。

Hope, it helps to find an answer a bit faster for someone.

希望,它有助于为某人更快地找到答案。

回答by tshirtdr1

I am posting an update to this old thread. I found one solution that worked, but requires opening each file. I opened my csv file in LibreOffice, chose Save As > edit filter settings. In the drop-down menu I chose UTF8 encoding. Then I added encoding="utf-8-sig"to the data = pd.read_csv(r'C:\fullpathtofile\filename.csv', sep = ',', encoding="utf-8-sig").

我正在发布对这个旧线程的更新。我找到了一个有效的解决方案,但需要打开每个文件。我在 LibreOffice 中打开了我的 csv 文件,选择了另存为 > 编辑过滤器设置。在下拉菜单中,我选择了 UTF8 编码。然后我添加encoding="utf-8-sig"data = pd.read_csv(r'C:\fullpathtofile\filename.csv', sep = ',', encoding="utf-8-sig").

Hope this helps someone.

希望这可以帮助某人。

回答by Jan33

Try specifying the engine='python'. It worked for me but I'm still trying to figure out why.

尝试指定 engine='python'。它对我有用,但我仍在试图找出原因。

df = pd.read_csv(input_file_path,...engine='python')

回答by Himanshu Sharma

I am using Jupyter-notebook. And in my case, it was showing the file in the wrong format. The 'encoding' option was not working. So I save the csv in utf-8 format, and it works.

我正在使用 Jupyter-notebook。就我而言,它以错误的格式显示文件。“编码”选项不起作用。所以我以 utf-8 格式保存 csv,它可以工作。