Pandas:为什么数字浮点数的默认列类型是?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38003406/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:27:26  来源:igfitidea点击:

Pandas: Why is default column type for numeric float?

pythoncsvpandasnanna

提问by user4979733

I am using Pandas 0.18.1 with python 2.7.x. I have an empty dataframe that I read first. I see that the types of these columns are objectwhich is OK. When I assign one row of data, the type for numeric values changes to float64. I was expecting intor int64. Why does this happen?

我正在使用 Pandas 0.18.1 和 python 2.7.x。我有一个我先阅读的空数据框。我看到这些列的类型object是可以的。当我分配一行数据时,数值的类型更改为float64. 我期待intint64。为什么会发生这种情况?

Is there a way to set some global option to let Pandas knows that for numeric values, treat them by default as intunless the data has a .? For example, [0 1.0, 2.], first column is intbut other two are float64?

有没有办法设置一些全局选项,让 Pandas 知道对于数值,默认情况下将它们视为int除非数据具有.? 例如,[0 1.0, 2.]第一列是,int但其他两列是float64

For example:

例如:

>>> df = pd.read_csv('foo.csv', engine='python', keep_default_na=False)
>>> print df.dtypes
bbox_id_seqno    object
type             object
layer            object
ll_x             object
ll_y             object
ur_x             object
ur_y             object
polygon_count    object
dtype: object
>>> df.loc[0] = ['a', 'b', 'c', 1, 2, 3, 4, 5]
>>> print df.dtypes
bbox_id_seqno     object
type              object
layer             object
ll_x             float64
ll_y             float64
ur_x             float64
ur_y             float64
polygon_count    float64
dtype: object

采纳答案by Stephan

It's not possible for Pandas to store NaNvalues in integer columns.

Pandas 不可能将NaN值存储在整数列中。

This makes floatthe obvious default choice for data storage, because as soon as missing value arises Pandas would have to change the data type for the entire column. And missing values arise very often in practice.

float显然是数据存储的默认选择,因为一旦出现缺失值,Pandas 就必须更改整个列的数据类型。并且在实践中经常会出现缺失值。

As for whythis is, it's a restriction inherited from Numpy. Basically, Pandas needs to set aside a particular bit pattern to represent NaN. This is straightforward for floating point numbers and it's defined in the IEEE 754 standard. It's more awkward and less efficient to do this for a fixed-width integer.

至于为什么会这样,这是从 Numpy 继承的限制。基本上,Pandas 需要留出一个特定的位模式来表示NaN. 这对于浮点数来说很简单,它在 IEEE 754 标准中定义。对于固定宽度的整数执行此操作更笨拙且效率更低。

Update

更新

Exciting news in pandas 0.24. IntegerArray is an experimental feature but might render my original answer obsolete. So if you're reading this on or after 27 Feb 2019, check out the docsfor that feature.

Pandas 0.24 中令人兴奋的消息。IntegerArray 是一项实验性功能,但可能会使我的原始答案过时。因此,如果您在 2019 年 2 月 27 日或之后阅读本文,请查看该功能的文档

回答by Batman

The why is almost certainly to do with flexibility and speed. Just because Pandas has only seen an integer in that column so far doesn't mean that you're not going to try to add a float later, which would require Pandas to go back and change the type for all that column. A float is the most robust/flexible numeric type.

原因几乎可以肯定与灵活性和速度有关。仅仅因为 Pandas 到目前为止只在该列中看到一个整数并不意味着您以后不会尝试添加浮点数,这将需要 Pandas 返回并更改所有该列的类型。浮点数是最健壮/灵活的数字类型。

There's no global way to override that behaviour (that I'm aware of), but you can use the astypemethod to modify an individual DataFrame.

没有全局方法可以覆盖该行为(我知道),但是您可以使用该astype方法来修改单个 DataFrame。

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html

回答by Alexander

If you are reading an empty dataframe, you can explicitly cast the types for each column after reading it.

如果您正在读取空数据框,则可以在读取后显式转换每列的类型。

dtypes = {
    'bbox_id_seqno': object,
    'type': object,
    'layer': object,
    'll_x': int,
    'll_y': int,
    'ur_x': int,
    'ur_y': int,
    'polygon_count': int
}


df = pd.read_csv('foo.csv', engine='python', keep_default_na=False)

for col, dtype in dtypes.iteritems():
    df[col] = df[col].astype(dtype)

df.loc[0] = ['a', 'b', 'c', 1, 2, 3, 4, 5]

>>> df.dtypes
bbox_id_seqno    object
type             object
layer            object
ll_x              int64
ll_y              int64
ur_x              int64
ur_y              int64
polygon_count     int64
dtype: object

If you don't know the column names in your empty dataframe, you can initially assign everything as an intand then let Pandas sort it out.

如果您不知道空数据框中的列名,您可以最初将所有内容分配为 an int,然后让 Pandas 对其进行排序。

for col in df:
    df[col] = df[col].astype(int)