Python pandas 将 csv ANSI 格式加载为 UTF-8

Question

提问by MBUser

I want to load a CSV File with pandas in Jupyter Notebooks which contains characters like ?,?,ü,?.

我想在 Jupyter Notebooks 中加载一个带有Pandas的 CSV 文件，其中包含诸如 ?,?,ü,? 之类的字符。

When i open the csv file with Notepad++ here is one example row which causes trouble in ANSI Format:

当我用 Notepad++ 打开 csv 文件时，这里是一个导致 ANSI 格式问题的示例行：

Empf?nger;Empf?ngerStadt;Empf?ngerStraáe;Empf?ngerHausnr.;Empf?ngerPLZ;Empf?ngerLand

The correct UTF-8 outcome for Empf?nger should be: Empf?nger

Empf?nger 的正确 UTF-8 结果应该是： Empf?nger

Now when i load the CSV Data in Python 3.6 pandas on Windows with the following code:

现在，当我使用以下代码在 Windows 上的 Python 3.6 pandas 中加载 CSV 数据时：

df_a = pd.read_csv('file.csv',sep=';',encoding='utf-8')

I get and Error Message:

我收到错误消息：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position xy: invalid continuation byte

Position 'xy' is the position where the character occurs that causes the error message

位置 'xy' 是导致错误信息的字符出现的位置

when i use the ansi format to load my csv file it works but display the umlaute incorrect.

当我使用 ansi 格式加载我的 csv 文件时，它可以工作，但显示的元音不正确。

Example code:

示例代码：

df_a = pd.read_csv('afile.csv',sep=';',encoding='ANSI')

Empf?nger is represented as: Empf?nger

Empf?nger 表示为： Empf?nger

Note: i have tried to convert the file to UTF-8 in Notepad++ and load it afterwards with the pandas module but i still get the same error.

注意：我尝试在 Notepad++ 中将文件转换为 UTF-8，然后使用 pandas 模块加载它，但我仍然遇到相同的错误。

I have searched online for a solution but the provided solutions such as "change format in notepad++ to utf-8" or "use encoding='UTF-8'" or 'latin1' which gives me the same result as ANSI format or

我在网上搜索了一个解决方案，但提供的解决方案例如“将记事本++中的格式更改为 utf-8”或“使用 encoding='UTF-8'”或“latin1”，这给了我与 ANSI 格式相同的结果或

import chardet

with open('afile.csv', 'rb') as f:
    result = chardet.detect(f.readline())

df_a = pd.read_csv('afile.csv',sep=';',encoding=result['encoding'])

didnt work for me.

没有对我来说有效。

encoding='cp1252'

throws the following exception:

抛出以下异常：

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2: character maps to <undefined>

I also tried to replace Strings afterwards with the x.replace()method but the character ü disappears completely after loaded into a pandas DataFrame

之后我也尝试用该x.replace()方法替换字符串，但字符 ü 在加载到 Pandas DataFrame 后完全消失

Answer 1

回答by Quoc Truong

You could use the encoding value UTF-16LEto solve the problem

您可以使用编码值UTF-16LE来解决问题

pd.read_csv("./file.csv", encoding="UTF-16LE")

The file.csvshould be saved using encoding UTF-16LEby NotePad++, option UCS-2 LE BOM

该file.csv应用编码被保存UTF-16LE由记事本++，选项UCS-2 LE BOM

Best,

最好的事物，

Answer 2

回答by MBUser

I couldnt find a proper solution after trying out all the well known encodings from ISO-8859-1 to 8859-15, from UTF-8 to UTF-32, from Windows-1250-1258 and nothing worked properly. So my guess is that the text encoding got corrupted during the export. My own solution to this is to load the textfile in a Dataframe with Windows-1251 as it does not cut out special characters in my text file and then replaced all broken characters with the corresponding ones. Its a rather dissatisfying solution that takes a lot of time to compute but its better than nothing.

在尝试了从 ISO-8859-1 到 8859-15、从 UTF-8 到 UTF-32、从 Windows-1250-1258 的所有众所周知的编码后，我找不到合适的解决方案，但没有任何工作正常。所以我的猜测是文本编码在导出过程中被破坏了。我自己的解决方案是使用 Windows-1251 在 Dataframe 中加载文本文件，因为它不会删除我的文本文件中的特殊字符，然后用相应的字符替换所有损坏的字符。这是一个相当不令人满意的解决方案，需要花费大量时间来计算，但总比没有好。

Answer 3

回答by BlackHyman

When Empf?ngerStra?eshows up as Empf?ngerStraáewhen decoded as ”ANSI”, or more correctly cp1250 in this case, then the actual encoding of the data is most likely cp850:

当Empf?ngerStra?e在解码为“ANSI”时显示为Empf?ngerStraáe，或者在这种情况下更准确的 cp1250，那么数据的实际编码很可能是 cp850：

print 'Empf?ngerStraáe'.decode('utf8').encode('cp1250').decode('cp850')

Python pandas 将 csv ANSI 格式加载为 UTF-8

提问by MBUser

回答by Quoc Truong

回答by MBUser

回答by BlackHyman

相关推荐

最近更新

标签

Python pandas 将 csv ANSI 格式加载为 UTF-8

提问by MBUser

回答by Quoc Truong

回答by MBUser

回答by BlackHyman

相关推荐

Pandas 数据框列中值的第一个实例

pandas Python中字典和pandas系列的区别

pandas 熊猫：将月份中的日期转换为下个月的第一天

如何重命名 Python Pandas 中的索引行？

相关推荐

最近更新

标签