Python 读取excel文件时的Pandas数据帧和字符编码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23594878/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas dataframe and character encoding when reading excel file
提问by Luis Miguel
I am reading an excel file that has several numerical and categorical data. The columns name_string contains characters in a foreign language. When I try to see the content of the name_string column, I get the results I want, but the foreign characters (that are displayed correctly in the excel spreadsheet) are displayed with the wrong encoding. Here is what I have:
我正在阅读一个包含多个数字和分类数据的 excel 文件。列 name_string 包含外语字符。当我尝试查看 name_string 列的内容时,我得到了我想要的结果,但是外来字符(在 excel 电子表格中正确显示)以错误的编码显示。这是我所拥有的:
import pandas as pd
df = pd.read_excel('MC_simulation.xlsx', 'DataSet', encoding='utf-8')
name_string = df.name_string.unique()
name_string.sort()
name_string
Producing the following:
生产以下产品:
array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)
In the last line, the correctly encoded name should be Cristina Fernández de Kirchner. Can anybody help me with this issue?
在最后一行中,正确编码的名称应该是 Cristina Fernández de Kirchner。有人可以帮我解决这个问题吗?
采纳答案by unutbu
Actually, the data is being parsed correctly into unicode, not strs
. The u
prefix indicate that the objects are unicode
. When a list, tuple, or NumPy array is printed, Python shows the repr
of the items in the sequence. So instead of seeing the printed version of the unicode
, you see the repr
:
实际上,数据被正确解析为unicode,而不是strs
. 该u
前缀表明对象unicode
。当打印列表、元组或 NumPy 数组时,Python 会显示repr
序列中的项。因此unicode
,您看到的不是 的印刷版本,而是repr
:
In [160]: repr(u'Cristina Fern\xe1ndez de Kirchner')
Out[160]: "u'Cristina Fern\xe1ndez de Kirchner'"
In [156]: print(u'Cristina Fern\xe1ndez de Kirchner')
Cristina Fernández de Kirchner
The purpose of the repr
is to provide an unambiguous string representation for each object. The printed verson of a unicode can be ambiguous because of invisible or unprintable characters.
的目的repr
是为每个对象提供明确的字符串表示。由于不可见或不可打印的字符,Unicode 的打印版本可能不明确。
If you print the DataFrame or Series, however, you'll get the printed version of the unicodes:
但是,如果您打印 DataFrame 或 Series,您将获得 unicode 的打印版本:
In [157]: df = pd.DataFrame({'foo':np.array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)})
.....: .....: .....:
In [158]: df
Out[158]:
foo
0 4th of July
1 911
2 Abab
3 Abass
4 Abcar
5 Abced
6 Ceded
7 Cedes
8 Cedfus
9 Ceding
10 Cedtim
11 Cedtol
12 Cedxer
13 Chevrolet Corvette
14 Chuck Norris
15 Cristina Fernández de Kirchner
[16 rows x 1 columns]