Python 读取excel文件时的Pandas数据帧和字符编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23594878/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:13:12  来源:igfitidea点击:

Pandas dataframe and character encoding when reading excel file

pythonexcelcharacter-encodingpandas

提问by Luis Miguel

I am reading an excel file that has several numerical and categorical data. The columns name_string contains characters in a foreign language. When I try to see the content of the name_string column, I get the results I want, but the foreign characters (that are displayed correctly in the excel spreadsheet) are displayed with the wrong encoding. Here is what I have:

我正在阅读一个包含多个数字和分类数据的 excel 文件。列 name_string 包含外语字符。当我尝试查看 name_string 列的内容时,我得到了我想要的结果,但是外来字符(在 excel 电子表格中正确显示)以错误的编码显示。这是我所拥有的:

import pandas as pd
df = pd.read_excel('MC_simulation.xlsx', 'DataSet', encoding='utf-8')
name_string = df.name_string.unique()
name_string.sort()
name_string

Producing the following:

生产以下产品:

array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
       u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
       u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
       u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)

In the last line, the correctly encoded name should be Cristina Fernández de Kirchner. Can anybody help me with this issue?

在最后一行中,正确编码的名称应该是 Cristina Fernández de Kirchner。有人可以帮我解决这个问题吗?

采纳答案by unutbu

Actually, the data is being parsed correctly into unicode, not strs. The uprefix indicate that the objects are unicode. When a list, tuple, or NumPy array is printed, Python shows the reprof the items in the sequence. So instead of seeing the printed version of the unicode, you see the repr:

实际上,数据被正确解析为unicode,而不是strs. 该u前缀表明对象unicode。当打印列表、元组或 NumPy 数组时,Python 会显示repr序列中的项。因此unicode,您看到的不是 的印刷版本,而是repr

In [160]: repr(u'Cristina Fern\xe1ndez de Kirchner')
Out[160]: "u'Cristina Fern\xe1ndez de Kirchner'"

In [156]: print(u'Cristina Fern\xe1ndez de Kirchner')
Cristina Fernández de Kirchner

The purpose of the repris to provide an unambiguous string representation for each object. The printed verson of a unicode can be ambiguous because of invisible or unprintable characters.

的目的repr是为每个对象提供明确的字符串表示。由于不可见或不可打印的字符,Unicode 的打印版本可能不明确。

If you print the DataFrame or Series, however, you'll get the printed version of the unicodes:

但是,如果您打印 DataFrame 或 Series,您将获得 unicode 的打印版本:

In [157]: df = pd.DataFrame({'foo':np.array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
       u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
       u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
       u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)})
   .....:    .....:    .....: 
In [158]: df
Out[158]: 
                               foo
0                      4th of July
1                              911
2                             Abab
3                            Abass
4                            Abcar
5                            Abced
6                            Ceded
7                            Cedes
8                           Cedfus
9                           Ceding
10                          Cedtim
11                          Cedtol
12                          Cedxer
13              Chevrolet Corvette
14                    Chuck Norris
15  Cristina Fernández de Kirchner

[16 rows x 1 columns]