pandas 外部合并后保留 Dataframe 列数据类型

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36743563/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:05:03  来源:igfitidea点击:

Preserve Dataframe column data type after outer merge

pythonpandas

提问by Jeff

When you merge two indexed dataframes on certain values using 'outer' merge, python/pandas automatically adds Null (NaN) values to the fields it could not match on. This is normal behaviour, but it changes the data type and you have to restate what data types the columns should have.

当您使用“外部”合并合并某些值上的两个索引数据帧时,python/pandas 会自动将 Null (NaN) 值添加到它无法匹配的字段中。这是正常行为,但它会更改数据类型,您必须重新声明列应具有的数据类型。

fillna()or dropna()do not seem to preserve data types immediately after the merge. Do I need a table structure in place?

fillna()或者dropna()在合并后似乎没有立即保留数据类型。我需要适当的表结构吗?

Typically I would run numpy np.where(field.isnull() etc)but that means running for all columns.

通常我会运行,numpy np.where(field.isnull() etc)但这意味着运行所有列。

Is there a workaround to this?

有解决方法吗?

回答by ALollz

This should really only be an issue with boolor intdtypes. float, objectand datetime64[ns]can already hold NaNor NaTwithout changing the type.

这应该真的只与一个问题boolintdtypes。float,object并且datetime64[ns]可以保持NaNNaT不改变类型。

Because of this, I'd recommend using the new Int64type for your integer or boolcolumns, which is capable of stroring NaN. For Booleans, they need to be converted to 1 or 0 instead of True or False, then to Int64. You should do this for all int and bool columns before the join, but I'll just illustrate on df2whose columns get NaNrows after the join:

因此,我建议Int64您为整数或bool列使用新类型,它能够存储NaN. 对于布尔值,它们需要转换为 1 或 0 而不是 True 或 False,然后转换为Int64. 您应该在连接之前对所有 int 和 bool 列执行此操作,但我将仅说明在连接后df2哪些列获得NaN行:

import pandas as pd

df = pd.DataFrame({'a': [1]*6, 'b': [1, 2]*3, 'c': range(6)})
df2 = pd.DataFrame({'d': [1,2], 'e': [True, False]})

df2 = df2.astype('int').astype('Int64')
df2.dtypes
#d    Int64
#e    Int64
#dtype: object

df.join(df2)
#   a  b  c    d    e
#0  1  1  0    1    1
#1  1  2  1    2    0
#2  1  1  2  NaN  NaN
#3  1  2  3  NaN  NaN
#4  1  1  4  NaN  NaN
#5  1  2  5  NaN  NaN

#a    int64
#b    int64
#c    int64
#d    Int64
#e    Int64
#dtype: object


The benefit here is that nothing will be upcast until it needs to. For instance, in the other solutions if you do .fillna(-1.72)you may get an unwanted answer as you call int(-1.72)which then coerces the fill value to -1. This could be useful in some situations, but dangerous in others.

这样做的好处是,除非需要,否则不会发生任何事情。例如,在其他解决方案中,如果您这样做,.fillna(-1.72)您可能会在调用时得到不需要的答案int(-1.72),然后将填充值强制为-1。这在某些情况下可能很有用,但在其他情况下很危险。

With Int64the fill value remains true to what you specify and the column is only upcast if you fill with a non-int. Also it will not throw an error if you do something like .fillna('Missing'), as it never tries to typecast a string to an int.

随着Int64填充值仍然忠实于您指定的内容和列仅上溯造型,如果你使用非INT填写。此外,如果您执行类似的操作.fillna('Missing'),它也不会抛出错误,因为它从不尝试将字符串类型转换为 int。

回答by hume

I don't think there's any really elegant/efficient way to do it. You could do it by tracking the original datatypes and then casting the columns after the merge, like this:

我认为没有任何真正优雅/有效的方法来做到这一点。您可以通过跟踪原始数据类型,然后在合并后转换列来实现,如下所示:

import pandas as pd

# all types are originally ints
df = pd.DataFrame({'a': [1]*10, 'b': [1, 2] * 5, 'c': range(10)})
df2 = pd.DataFrame({'e': [1, 1], 'd': [1, 2]})

# track the original dtypes
orig = df.dtypes.to_dict()
orig.update(df2.dtypes.to_dict())

# join the dataframe
joined = df.join(df2, how='outer')

# columns with nans are now float dtype
print joined.dtypes

# replace nans with suitable int value
joined.fillna(-1, inplace=True)

# re-cast the columns as their original dtype
joined_orig_types = joined.apply(lambda x: x.astype(orig[x.name]))

print joined_orig_types.dtypes

回答by anky

Or you can just do a concat/append on dtypesof both dfs and applyastype():

或者,您可以dtypesdfs 和 apply执行 concat/append astype()

joined = df.join(df2, how='outer').fillna(-1).astype(pd.concat([df.dtypes,df2.dtypes]))
#or joined = df.join(df2, how='outer').fillna(-1).astype(df.dtypes.append(df2.dtypes))
print(joined)

   a  b  c  e  d
0  1  1  0  1  1
1  1  2  1  1  2
2  1  1  2 -1 -1
3  1  2  3 -1 -1
4  1  1  4 -1 -1
5  1  2  5 -1 -1
6  1  1  6 -1 -1
7  1  2  7 -1 -1
8  1  1  8 -1 -1
9  1  2  9 -1 -1

回答by totalhack

As of pandas 1.0.0 I believe you have another option, which is to first use convert_dtypes. This converts the dataframe columns to dtypes that support pd.NA, avoiding the issues with NaN. This preserves the bool values as well unlike thisanswer.

从 pandas 1.0.0 开始,我相信您还有另一种选择,即首先使用convert_dtypes。这会将数据帧列转换为支持 pd.NA 的 dtype,从而避免了 NaN 的问题。与答案不同,也保留了 bool 值。

...

df = pd.DataFrame({'a': [1]*6, 'b': [1, 2]*3, 'c': range(6)})
df2 = pd.DataFrame({'d': [1,2], 'e': [True, False]})
df = df.convert_dtypes()
df2 = df2.convert_dtypes()
print(df.join(df2))

#   a  b  c     d      e
#0  1  1  0     1   True
#1  1  2  1     2  False
#2  1  1  2  <NA>   <NA>
#3  1  2  3  <NA>   <NA>
#4  1  1  4  <NA>   <NA>
#5  1  2  5  <NA>   <NA>