pandas 外部合并后保留 Dataframe 列数据类型

Question

提问by Jeff

When you merge two indexed dataframes on certain values using 'outer' merge, python/pandas automatically adds Null (NaN) values to the fields it could not match on. This is normal behaviour, but it changes the data type and you have to restate what data types the columns should have.

当您使用“外部”合并合并某些值上的两个索引数据帧时，python/pandas 会自动将 Null (NaN) 值添加到它无法匹配的字段中。这是正常行为，但它会更改数据类型，您必须重新声明列应具有的数据类型。

fillna()or dropna()do not seem to preserve data types immediately after the merge. Do I need a table structure in place?

fillna()或者dropna()在合并后似乎没有立即保留数据类型。我需要适当的表结构吗？

Typically I would run numpy np.where(field.isnull() etc)but that means running for all columns.

通常我会运行，numpy np.where(field.isnull() etc)但这意味着运行所有列。

Is there a workaround to this?

有解决方法吗？

Answer 1

回答by ALollz

This should really only be an issue with boolor intdtypes. float, objectand datetime64[ns]can already hold NaNor NaTwithout changing the type.

这应该真的只与一个问题bool或intdtypes。float,object并且datetime64[ns]可以保持NaN或NaT不改变类型。

Because of this, I'd recommend using the new Int64type for your integer or boolcolumns, which is capable of stroring NaN. For Booleans, they need to be converted to 1 or 0 instead of True or False, then to Int64. You should do this for all int and bool columns before the join, but I'll just illustrate on df2whose columns get NaNrows after the join:

因此，我建议Int64您为整数或bool列使用新类型，它能够存储NaN. 对于布尔值，它们需要转换为 1 或 0 而不是 True 或 False，然后转换为Int64. 您应该在连接之前对所有 int 和 bool 列执行此操作，但我将仅说明在连接后df2哪些列获得NaN行：

import pandas as pd

df = pd.DataFrame({'a': [1]*6, 'b': [1, 2]*3, 'c': range(6)})
df2 = pd.DataFrame({'d': [1,2], 'e': [True, False]})

df2 = df2.astype('int').astype('Int64')
df2.dtypes
#d    Int64
#e    Int64
#dtype: object

df.join(df2)
#   a  b  c    d    e
#0  1  1  0    1    1
#1  1  2  1    2    0
#2  1  1  2  NaN  NaN
#3  1  2  3  NaN  NaN
#4  1  1  4  NaN  NaN
#5  1  2  5  NaN  NaN

#a    int64
#b    int64
#c    int64
#d    Int64
#e    Int64
#dtype: object

The benefit here is that nothing will be upcast until it needs to. For instance, in the other solutions if you do .fillna(-1.72)you may get an unwanted answer as you call int(-1.72)which then coerces the fill value to -1. This could be useful in some situations, but dangerous in others.

这样做的好处是，除非需要，否则不会发生任何事情。例如，在其他解决方案中，如果您这样做，.fillna(-1.72)您可能会在调用时得到不需要的答案int(-1.72)，然后将填充值强制为-1。这在某些情况下可能很有用，但在其他情况下很危险。

With Int64the fill value remains true to what you specify and the column is only upcast if you fill with a non-int. Also it will not throw an error if you do something like .fillna('Missing'), as it never tries to typecast a string to an int.

随着Int64填充值仍然忠实于您指定的内容和列仅上溯造型，如果你使用非INT填写。此外，如果您执行类似的操作.fillna('Missing')，它也不会抛出错误，因为它从不尝试将字符串类型转换为 int。

Answer 2

回答by hume

I don't think there's any really elegant/efficient way to do it. You could do it by tracking the original datatypes and then casting the columns after the merge, like this:

我认为没有任何真正优雅/有效的方法来做到这一点。您可以通过跟踪原始数据类型，然后在合并后转换列来实现，如下所示：

import pandas as pd

# all types are originally ints
df = pd.DataFrame({'a': [1]*10, 'b': [1, 2] * 5, 'c': range(10)})
df2 = pd.DataFrame({'e': [1, 1], 'd': [1, 2]})

# track the original dtypes
orig = df.dtypes.to_dict()
orig.update(df2.dtypes.to_dict())

# join the dataframe
joined = df.join(df2, how='outer')

# columns with nans are now float dtype
print joined.dtypes

# replace nans with suitable int value
joined.fillna(-1, inplace=True)

# re-cast the columns as their original dtype
joined_orig_types = joined.apply(lambda x: x.astype(orig[x.name]))

print joined_orig_types.dtypes

Answer 3

回答by anky

Or you can just do a concat/append on dtypesof both dfs and applyastype():

或者，您可以dtypes对dfs 和 apply执行 concat/append astype()：

joined = df.join(df2, how='outer').fillna(-1).astype(pd.concat([df.dtypes,df2.dtypes]))
#or joined = df.join(df2, how='outer').fillna(-1).astype(df.dtypes.append(df2.dtypes))
print(joined)

   a  b  c  e  d
0  1  1  0  1  1
1  1  2  1  1  2
2  1  1  2 -1 -1
3  1  2  3 -1 -1
4  1  1  4 -1 -1
5  1  2  5 -1 -1
6  1  1  6 -1 -1
7  1  2  7 -1 -1
8  1  1  8 -1 -1
9  1  2  9 -1 -1

Answer 4

回答by totalhack

As of pandas 1.0.0 I believe you have another option, which is to first use convert_dtypes. This converts the dataframe columns to dtypes that support pd.NA, avoiding the issues with NaN. This preserves the bool values as well unlike thisanswer.

从 pandas 1.0.0 开始，我相信您还有另一种选择，即首先使用convert_dtypes。这会将数据帧列转换为支持 pd.NA 的 dtype，从而避免了 NaN 的问题。与此答案不同，这也保留了 bool 值。

...

df = pd.DataFrame({'a': [1]*6, 'b': [1, 2]*3, 'c': range(6)})
df2 = pd.DataFrame({'d': [1,2], 'e': [True, False]})
df = df.convert_dtypes()
df2 = df2.convert_dtypes()
print(df.join(df2))

#   a  b  c     d      e
#0  1  1  0     1   True
#1  1  2  1     2  False
#2  1  1  2  <NA>   <NA>
#3  1  2  3  <NA>   <NA>
#4  1  1  4  <NA>   <NA>
#5  1  2  5  <NA>   <NA>

pandas 外部合并后保留 Dataframe 列数据类型

提问by Jeff

回答by ALollz

回答by hume

回答by anky

回答by totalhack

相关推荐

最近更新

标签

pandas 外部合并后保留 Dataframe 列数据类型

提问by Jeff

回答by ALollz

回答by hume

回答by anky

回答by totalhack

相关推荐

pandas 在熊猫中从宽到长重塑

pandas 如何在熊猫数据框中找到一列的 ngram 频率？

pandas 计算 DateTimeIndex 的时差

在 Pandas 中使用多处理读取 csv 文件的最简单方法

相关推荐

最近更新

标签