Python pandas 将 csv ANSI 格式加载为 UTF-8
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43786852/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python pandas load csv ANSI Format as UTF-8
提问by MBUser
I want to load a CSV File with pandas in Jupyter Notebooks which contains characters like ?,?,ü,?.
我想在 Jupyter Notebooks 中加载一个带有Pandas的 CSV 文件,其中包含诸如 ?,?,ü,? 之类的字符。
When i open the csv file with Notepad++ here is one example row which causes trouble in ANSI Format:
当我用 Notepad++ 打开 csv 文件时,这里是一个导致 ANSI 格式问题的示例行:
Empf?nger;Empf?ngerStadt;Empf?ngerStraáe;Empf?ngerHausnr.;Empf?ngerPLZ;Empf?ngerLand
The correct UTF-8 outcome for Empf?nger should be: Empf?nger
Empf?nger 的正确 UTF-8 结果应该是: Empf?nger
Now when i load the CSV Data in Python 3.6 pandas on Windows with the following code:
现在,当我使用以下代码在 Windows 上的 Python 3.6 pandas 中加载 CSV 数据时:
df_a = pd.read_csv('file.csv',sep=';',encoding='utf-8')
I get and Error Message:
我收到错误消息:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position xy: invalid continuation byte
Position 'xy' is the position where the character occurs that causes the error message
位置 'xy' 是导致错误信息的字符出现的位置
when i use the ansi format to load my csv file it works but display the umlaute incorrect.
当我使用 ansi 格式加载我的 csv 文件时,它可以工作,但显示的元音不正确。
Example code:
示例代码:
df_a = pd.read_csv('afile.csv',sep=';',encoding='ANSI')
Empf?nger is represented as: Empf?nger
Empf?nger 表示为: Empf?nger
Note: i have tried to convert the file to UTF-8 in Notepad++ and load it afterwards with the pandas module but i still get the same error.
注意:我尝试在 Notepad++ 中将文件转换为 UTF-8,然后使用 pandas 模块加载它,但我仍然遇到相同的错误。
I have searched online for a solution but the provided solutions such as "change format in notepad++ to utf-8" or "use encoding='UTF-8'" or 'latin1' which gives me the same result as ANSI format or
我在网上搜索了一个解决方案,但提供的解决方案例如“将记事本++中的格式更改为 utf-8”或“使用 encoding='UTF-8'”或“latin1”,这给了我与 ANSI 格式相同的结果或
import chardet
with open('afile.csv', 'rb') as f:
result = chardet.detect(f.readline())
df_a = pd.read_csv('afile.csv',sep=';',encoding=result['encoding'])
didnt work for me.
没有对我来说有效。
encoding='cp1252'
throws the following exception:
抛出以下异常:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2: character maps to <undefined>
I also tried to replace Strings afterwards with the x.replace()
method but the character ü disappears completely after loaded into a pandas DataFrame
之后我也尝试用该x.replace()
方法替换字符串,但字符 ü 在加载到 Pandas DataFrame 后完全消失
回答by Quoc Truong
You could use the encoding value UTF-16LE
to solve the problem
您可以使用编码值UTF-16LE
来解决问题
pd.read_csv("./file.csv", encoding="UTF-16LE")
The file.csv
should be saved using encoding UTF-16LE
by NotePad++, option UCS-2 LE BOM
该file.csv
应用编码被保存UTF-16LE
由记事本++,选项UCS-2 LE BOM
Best,
最好的事物,
回答by MBUser
I couldnt find a proper solution after trying out all the well known encodings from ISO-8859-1 to 8859-15, from UTF-8 to UTF-32, from Windows-1250-1258 and nothing worked properly. So my guess is that the text encoding got corrupted during the export. My own solution to this is to load the textfile in a Dataframe with Windows-1251 as it does not cut out special characters in my text file and then replaced all broken characters with the corresponding ones. Its a rather dissatisfying solution that takes a lot of time to compute but its better than nothing.
在尝试了从 ISO-8859-1 到 8859-15、从 UTF-8 到 UTF-32、从 Windows-1250-1258 的所有众所周知的编码后,我找不到合适的解决方案,但没有任何工作正常。所以我的猜测是文本编码在导出过程中被破坏了。我自己的解决方案是使用 Windows-1251 在 Dataframe 中加载文本文件,因为它不会删除我的文本文件中的特殊字符,然后用相应的字符替换所有损坏的字符。这是一个相当不令人满意的解决方案,需要花费大量时间来计算,但总比没有好。
回答by BlackHyman
When Empf?ngerStra?eshows up as Empf?ngerStraáewhen decoded as ”ANSI”, or more correctly cp1250 in this case, then the actual encoding of the data is most likely cp850:
当Empf?ngerStra?e在解码为“ANSI”时显示为Empf?ngerStraáe,或者在这种情况下更准确的 cp1250,那么数据的实际编码很可能是 cp850:
print 'Empf?ngerStraáe'.decode('utf8').encode('cp1250').decode('cp850')