Python UnicodeDecodeError: ('utf-8' codec) 读取 csv 文件时
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33819557/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UnicodeDecodeError: ('utf-8' codec) while reading a csv file
提问by Satya
what i am trying is reading a csv to make a dataframe---making changes in a column---again updating/reflecting changed value into same csv(to_csv)- again trying to read that csv to make another dataframe...there i am getting an error
我正在尝试读取 csv 以制作数据框---在列中进行更改---再次将更改的值更新/反映到相同的 csv(to_csv)-再次尝试读取该 csv 以制作另一个数据框......那里我收到一个错误
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte
my code is
我的代码是
import pandas as pd
df = pd.read_csv("D:\ss.csv")
df.columns #o/p is Index(['CUSTOMER_MAILID', 'False', 'True'], dtype='object')
df['True'] = df['True'] + 2 #making changes to one column of type float
df.to_csv("D:\ss.csv") #updating that .csv
df1 = pd.read_csv("D:\ss.csv") #again trying to read that csv
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte
So please suggest how can i avoid the error and be able to read that csv again to a dataframe.
所以请建议我如何避免错误并能够再次读取该 csv 到数据帧。
I know somewhere i am missing "encode = some codec type" or "decode = some type" while reading and writing to csv.
我知道我在读取和写入 csv 时遗漏了“编码 = 某种编解码器类型”或“解码 = 某种类型”。
But i don't know what exactly should be changed.so need help.
但我不知道究竟应该改变什么。所以需要帮助。
回答by Mangu Singh Rajpurohit
Yes you'll get this error. I have work around with this problem, by opening csv file in notepad++ and changing the encoding throught Encoding menu -> convert to UTF-8. Then saving the file. Then again running python program over it.
是的,您会收到此错误。我已经解决了这个问题,方法是在记事本++中打开 csv 文件并通过编码菜单更改编码 -> 转换为 UTF-8。然后保存文件。然后再次在其上运行 python 程序。
Other solution is using codecs module in python for encoding-decoding of files. I haven't used that.
其他解决方案是在 python 中使用 codecs 模块对文件进行编码-解码。我没用过那个。
回答by rmunn
Is that error happening on your first read of the data, or on the second read after you write it out and read it back in again? My guess is that it's actually happening on the firstread of the data, because your CSV has an encoding that isn't UTF-8.
该错误是在您第一次读取数据时发生的,还是在您将其写出并再次读回后的第二次读取时发生的?我的猜测是它实际上是在第一次读取数据时发生的,因为您的 CSV 的编码不是 UTF-8。
Try opening that CSV file in Notepad++, or Excel, or LibreOffice. Does your data source have the ? (C with cedilla) character in it? If it does, then that 0xE7 byte you're seeing is probably the ? encoded in either Latin-1 or Windows-1252 (called "cp1252" in Python).
尝试在 Notepad++、Excel 或 LibreOffice 中打开该 CSV 文件。你的数据源有吗?(C with cedilla) 字符在吗?如果是这样,那么您看到的 0xE7 字节可能是 ? 以 Latin-1 或 Windows-1252(在 Python 中称为“cp1252”)编码。
Looking at the documentationfor the Pandas read_csv()
function, I see it has an encoding
parameter, which should be the name of the encoding you expect that CSV file to be in. So try adding encoding="cp1252"
to your read_csv()
call, as follows:
查看Pandas函数的文档read_csv()
,我看到它有一个encoding
参数,它应该是您期望 CSV 文件所在的编码的名称。因此尝试添加encoding="cp1252"
到您的read_csv()
调用中,如下所示:
df = pd.read_csv(r"D:\ss.csv", encoding="cp1252")
Note that I added the character r
in front of the filename, so that it will be considered a "raw string" and backslashes won't be treated specially. That way you don't get a surprise when you change the filename from ss.csv
to new-ss.csv
, where the string D:\new-ss.csv
would be read as D
, :
, newline character, e
, w
, etc.
请注意,我r
在文件名前面添加了字符,因此它将被视为“原始字符串”并且不会对反斜杠进行特殊处理。这样,当您更改文件名,你没有得到一个惊喜ss.csv
来new-ss.csv
,那里的字符串D:\new-ss.csv
会被理解为D
,:
,换行符,e
,w
,等。
Anyway, try that encoding parameter on your first read_csv()
call and see if it works. (It's only a guess, since I don't know your actual data. If the data file isn't private and isn't too large, try posting the data file so we can see its contents -- that would let us do better than just guessing.)
无论如何,请在第一次read_csv()
调用时尝试该编码参数,看看它是否有效。(这只是一个猜测,因为我不知道您的实际数据。如果数据文件不是私有的并且不是太大,请尝试发布数据文件以便我们可以看到其内容——这会让我们做得更好而不仅仅是猜测。)
回答by MaxNoe
Known encoding
已知编码
If you know the encoding of the file you want to read in, you can use
如果您知道要读入的文件的编码,则可以使用
pd.read_csv('filename.txt', encoding='encoding')
These are the possible encodings: https://docs.python.org/3/library/codecs.html#standard-encodings
这些是可能的编码:https: //docs.python.org/3/library/codecs.html#standard-encodings
Unknown encoding
未知编码
If you do not know the encoding, you can try to use chardet, however this is not guaranteed to work. It is more a guess work.
如果您不知道编码,您可以尝试使用 chardet,但这不能保证有效。这更像是一种猜测工作。
import chardet
import pandas as pd
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])
回答by Krishnaa
One simple solution is you can open the csv file in an editor like Sublime Text and save it with 'utf-8' encoding. Then we can easily read the file through pandas.
一个简单的解决方案是您可以在 Sublime Text 等编辑器中打开 csv 文件,并使用“utf-8”编码保存它。然后我们就可以通过pandas轻松读取文件了。
回答by Matt
I am new to python. Ran into this exact issue when I manually changed the extension on my excel file to .csv and tried to read it with read_csv. However, if I opened the excel file and saved as csv file instead it seemed to work.
我是python的新手。当我手动将我的 excel 文件的扩展名更改为 .csv 并尝试使用 read_csv 读取它时,遇到了这个确切的问题。但是,如果我打开 excel 文件并另存为 csv 文件,它似乎可以工作。
回答by Abhishek
Above method used by importing and then detecting file type works import chardet
通过导入然后检测文件类型使用的上述方法工作导入chardet
import pandas as pd
import chardet
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])