使用 read_excel 和转换器将 Excel 文件读入 Pandas DataFrame 会生成对象类型的数字列

Question

提问by Krzysztof S?owiński

I am reading this Excel file United Nations Energy Indicatorsusing the code snippet here:

我正在使用此处的代码片段阅读此 Excel 文件联合国能源指标：

def convert_energy(energy):
    if isinstance(energy, float):
        return energy*1000000
    else:
        return energy

def energy_df():
    return pd.read_excel("Energy Indicators.xls", skiprows=17, skip_footer=38, usecols=[2,3,4,5], na_values=['...'], names=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable'], converters={1: convert_energy}).set_index('Country')

This results in Energy Supplycolumn having the object type instead of float. Why is it the case?

这导致Energy Supply列具有对象类型而不是浮动。为什么会这样？

energy = energy_df()
print(energy.dtypes)

Energy Supply                object
Energy Supply per Capita    float64
% Renewable                 float64

Answer 1

回答by cs95

Let's remove the convertersargument for a moment -

让我们converters暂时移除争论——

c = ['Energy Supply', 'Energy Supply per Capita', '% Renewable']
df = pd.read_excel("Energy Indicators.xls", 
                   skiprows=17, 
                   skip_footer=38, 
                   usecols=[2,3,4,5], 
                   na_values=['...'], 
                   names=c,
                   index_col=[0])

df.index.name = 'Country'

df.head()    
                Energy Supply  Energy Supply per Capita  % Renewable
Country                                                             
Afghanistan             321.0                      10.0    78.669280
Albania                 102.0                      35.0   100.000000
Algeria                1959.0                      51.0     0.551010
American Samoa            NaN                       NaN     0.641026
Andorra                   9.0                     121.0    88.695650

df.dtypes

Energy Supply               float64
Energy Supply per Capita    float64
% Renewable                 float64
dtype: object

Your data loads just fine without a converter. There's a trick to understanding why this happens.

您的数据无需转换器即可正常加载。有一个技巧可以理解为什么会发生这种情况。

By default, pandaswill read in the column and try to "interpret" your data. By specifying your own converter, you override pandas conversion, so this does not happen.

默认情况下，pandas将在列中读取并尝试“解释”您的数据。通过指定您自己的转换器，您可以覆盖 pandas 转换，因此不会发生这种情况。

pandas passes integer and string values to convert_energy, so the isinstance(energy, float)is never evaluated to True. Instead, the elseruns, and these values are returned as is, so your resultant column is a mixture of strings and integers. If you put a print(type(energy))inside your function, this becomes obvious.

pandas 将整数和字符串值传递给convert_energy，因此isinstance(energy, float)永远不会评估为True。相反，else运行和这些值按原样返回，因此您的结果列是字符串和整数的混合。如果你把 aprint(type(energy))放在你的函数中，这变得很明显。

Since you have mixtures of types, the resultant type is object. However, if you do not use a converter, pandas will attempt to interpret your data, and will successfully parse it to numeric.

由于您有多种类型，因此结果类型为object. 但是，如果您不使用转换器，pandas 将尝试解释您的数据，并将其成功解析为数字。

So, just doing -

所以，只是做 -

df['Energy Supply'] *= 1000000

Would be more than enough.

会绰绰有余。

Answer 2

回答by Scott Boston

One of the values for energy in your excel file is a string "..." and when in your coverter function, you just return energy as is if it is a string datatype.

excel 文件中的能量值之一是字符串“...”，当在转换函数中时，您只需返回能量，就好像它是字符串数据类型一样。

Therefore you are getting a string returned along with your numbers which then changes the dtype of you column to 'object.

因此，您将收到一个与您的数字一起返回的字符串，然后将您的列的 dtype 更改为 'object.

You could try something like this:

你可以尝试这样的事情：

def convert_energy(energy):
    if energy == "...":
        return np.nan
    elif isinstance(energy, float):
        return float(energy*1000000)
    else:
        return float(energy)

df = pd.read_excel('http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls', 
                   skiprows=17, skip_footer=38, 
                   usecols=[2,3,4,5], na_values=['...'], 
                   names=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable'],
                   converters={1: convert_energy}).set_index('Country')

df.info()

Output:

输出：

<class 'pandas.core.frame.DataFrame'>
Index: 227 entries, Afghanistan to Zimbabwe
Data columns (total 3 columns):
Energy Supply               222 non-null float64
Energy Supply per Capita    222 non-null float64
% Renewable                 227 non-null float64
dtypes: float64(3)
memory usage: 6.2+ KB

使用 read_excel 和转换器将 Excel 文件读入 Pandas DataFrame 会生成对象类型的数字列

提问by Krzysztof S?owiński

回答by cs95

回答by Scott Boston

相关推荐

最近更新

标签

使用 read_excel 和转换器将 Excel 文件读入 Pandas DataFrame 会生成对象类型的数字列

提问by Krzysztof S?owiński

回答by cs95

回答by Scott Boston

相关推荐

pandas 使用pandas读取csv文件时如何选择多行？

pandas 如何按行随机打乱pandas数据帧

Pandas - 将时间戳四舍五入到最接近的秒

pandas 如何将字符串转换为整数熊猫

相关推荐

最近更新

标签