Python 读取excel文件时的Pandas数据帧和字符编码

Question

提问by Luis Miguel

I am reading an excel file that has several numerical and categorical data. The columns name_string contains characters in a foreign language. When I try to see the content of the name_string column, I get the results I want, but the foreign characters (that are displayed correctly in the excel spreadsheet) are displayed with the wrong encoding. Here is what I have:

我正在阅读一个包含多个数字和分类数据的 excel 文件。列 name_string 包含外语字符。当我尝试查看 name_string 列的内容时，我得到了我想要的结果，但是外来字符（在 excel 电子表格中正确显示）以错误的编码显示。这是我所拥有的：

import pandas as pd
df = pd.read_excel('MC_simulation.xlsx', 'DataSet', encoding='utf-8')
name_string = df.name_string.unique()
name_string.sort()
name_string

Producing the following:

生产以下产品：

array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
       u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
       u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
       u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)

In the last line, the correctly encoded name should be Cristina Fernández de Kirchner. Can anybody help me with this issue?

在最后一行中，正确编码的名称应该是 Cristina Fernández de Kirchner。有人可以帮我解决这个问题吗？

Answer 1

采纳答案by unutbu

Actually, the data is being parsed correctly into unicode, not strs. The uprefix indicate that the objects are unicode. When a list, tuple, or NumPy array is printed, Python shows the reprof the items in the sequence. So instead of seeing the printed version of the unicode, you see the repr:

实际上，数据被正确解析为unicode，而不是strs. 该u前缀表明对象unicode。当打印列表、元组或 NumPy 数组时，Python 会显示repr序列中的项。因此unicode，您看到的不是的印刷版本，而是repr：

In [160]: repr(u'Cristina Fern\xe1ndez de Kirchner')
Out[160]: "u'Cristina Fern\xe1ndez de Kirchner'"

In [156]: print(u'Cristina Fern\xe1ndez de Kirchner')
Cristina Fernández de Kirchner

The purpose of the repris to provide an unambiguous string representation for each object. The printed verson of a unicode can be ambiguous because of invisible or unprintable characters.

的目的repr是为每个对象提供明确的字符串表示。由于不可见或不可打印的字符，Unicode 的打印版本可能不明确。

If you print the DataFrame or Series, however, you'll get the printed version of the unicodes:

但是，如果您打印 DataFrame 或 Series，您将获得 unicode 的打印版本：

In [157]: df = pd.DataFrame({'foo':np.array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
       u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
       u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
       u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)})
   .....:    .....:    .....: 
In [158]: df
Out[158]: 
                               foo
0                      4th of July
1                              911
2                             Abab
3                            Abass
4                            Abcar
5                            Abced
6                            Ceded
7                            Cedes
8                           Cedfus
9                           Ceding
10                          Cedtim
11                          Cedtol
12                          Cedxer
13              Chevrolet Corvette
14                    Chuck Norris
15  Cristina Fernández de Kirchner

[16 rows x 1 columns]

Python 读取excel文件时的Pandas数据帧和字符编码

提问by Luis Miguel

采纳答案by unutbu

相关推荐

最近更新

标签

Python 读取excel文件时的Pandas数据帧和字符编码

提问by Luis Miguel

采纳答案by unutbu

相关推荐

找不到 Python 可执行文件“python”

Python 添加边权重以在 networkx 中绘制输出

Python 从 json 脚本输出中刮取

具有多个自变量的 Python 曲线拟合

相关推荐

最近更新

标签