Python pandas 将 csv ANSI 格式加载为 UTF-8

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43786852/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:32:08  来源:igfitidea点击:

Python pandas load csv ANSI Format as UTF-8

pythoncsvpandasdecode

提问by MBUser

I want to load a CSV File with pandas in Jupyter Notebooks which contains characters like ?,?,ü,?.

我想在 Jupyter Notebooks 中加载一个带有Pandas的 CSV 文件,其中包含诸如 ?,?,ü,? 之类的字符。

When i open the csv file with Notepad++ here is one example row which causes trouble in ANSI Format:

当我用 Notepad++ 打开 csv 文件时,这里是一个导致 ANSI 格式问题的示例行:

Empf?nger;Empf?ngerStadt;Empf?ngerStraáe;Empf?ngerHausnr.;Empf?ngerPLZ;Empf?ngerLand

The correct UTF-8 outcome for Empf?nger should be: Empf?nger

Empf?nger 的正确 UTF-8 结果应该是: Empf?nger

Now when i load the CSV Data in Python 3.6 pandas on Windows with the following code:

现在,当我使用以下代码在 Windows 上的 Python 3.6 pandas 中加载 CSV 数据时:

df_a = pd.read_csv('file.csv',sep=';',encoding='utf-8')

I get and Error Message:

我收到错误消息:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position xy: invalid continuation byte

Position 'xy' is the position where the character occurs that causes the error message

位置 'xy' 是导致错误信息的字符出现的位置

when i use the ansi format to load my csv file it works but display the umlaute incorrect.

当我使用 ansi 格式加载我的 csv 文件时,它可以工作,但显示的元音不正确。

Example code:

示例代码:

df_a = pd.read_csv('afile.csv',sep=';',encoding='ANSI')

Empf?nger is represented as: Empf?nger

Empf?nger 表示为: Empf?nger

Note: i have tried to convert the file to UTF-8 in Notepad++ and load it afterwards with the pandas module but i still get the same error.

注意:我尝试在 Notepad++ 中将文件转换为 UTF-8,然后使用 pandas 模块加载它,但我仍然遇到相同的错误。

I have searched online for a solution but the provided solutions such as "change format in notepad++ to utf-8" or "use encoding='UTF-8'" or 'latin1' which gives me the same result as ANSI format or

我在网上搜索了一个解决方案,但提供的解决方案例如“将记事本++中的格式更改为 utf-8”或“使用 encoding='UTF-8'”或“latin1”,这给了我与 ANSI 格式相同的结果或

import chardet

with open('afile.csv', 'rb') as f:
    result = chardet.detect(f.readline())

df_a = pd.read_csv('afile.csv',sep=';',encoding=result['encoding'])

didnt work for me.

没有对我来说有效。

encoding='cp1252'

throws the following exception:

抛出以下异常:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2: character maps to <undefined>

I also tried to replace Strings afterwards with the x.replace()method but the character ü disappears completely after loaded into a pandas DataFrame

之后我也尝试用该x.replace()方法替换字符串,但字符 ü 在加载到 Pandas DataFrame 后完全消失

回答by Quoc Truong

You could use the encoding value UTF-16LEto solve the problem

您可以使用编码值UTF-16LE来解决问题

pd.read_csv("./file.csv", encoding="UTF-16LE")

The file.csvshould be saved using encoding UTF-16LEby NotePad++, option UCS-2 LE BOM

file.csv应用编码被保存UTF-16LE由记事本++,选项UCS-2 LE BOM

Best,

最好的事物,

回答by MBUser

I couldnt find a proper solution after trying out all the well known encodings from ISO-8859-1 to 8859-15, from UTF-8 to UTF-32, from Windows-1250-1258 and nothing worked properly. So my guess is that the text encoding got corrupted during the export. My own solution to this is to load the textfile in a Dataframe with Windows-1251 as it does not cut out special characters in my text file and then replaced all broken characters with the corresponding ones. Its a rather dissatisfying solution that takes a lot of time to compute but its better than nothing.

在尝试了从 ISO-8859-1 到 8859-15、从 UTF-8 到 UTF-32、从 Windows-1250-1258 的所有众所周知的编码后,我找不到合适的解决方案,但没有任何工作正常。所以我的猜测是文本编码在导出过程中被破坏了。我自己的解决方案是使用 Windows-1251 在 Dataframe 中加载文本文件,因为它不会删除我的文本文件中的特殊字符,然后用相应的字符替换所有损坏的字符。这是一个相当不令人满意的解决方案,需要花费大量时间来计算,但总比没有好。

回答by BlackHyman

When Empf?ngerStra?eshows up as Empf?ngerStraáewhen decoded as ”ANSI”, or more correctly cp1250 in this case, then the actual encoding of the data is most likely cp850:

Empf?ngerStra?e在解码为“ANSI”时显示为Empf?ngerStraáe,或者在这种情况下更准确的 cp1250,那么数据的实际编码很可能是 cp850:

print 'Empf?ngerStraáe'.decode('utf8').encode('cp1250').decode('cp850')