pandas 熊猫 - 与缺失值合并

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23940181/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:05:52  来源:igfitidea点击:

pandas - merging with missing values

pythonmergepandasmissing-data

提问by aensm

There appears to be a quirk with the pandas merge function. It considers NaNvalues to be equal, and will merge NaNs with other NaNs:

Pandas合并功能似乎有一个怪癖。它认为NaN值相等,并将NaNs 与其他NaNs合并:

>>> foo = DataFrame([
    ['a',1,2],
    ['b',4,5],
    ['c',7,8],
    [np.NaN,10,11]
], columns=['id','x','y'])

>>> bar = DataFrame([
    ['a',3],
    ['c',9],
    [np.NaN,12]
], columns=['id','z'])

>>> pd.merge(foo, bar, how='left', on='id')
Out[428]: 
    id   x   y   z
0    a   1   2   3
1    b   4   5 NaN
2    c   7   8   9
3  NaN  10  11  12

[4 rows x 4 columns]

This is unlike any RDB I've seen, normally missing values are treated with agnosticism and won't be merged together as if they are equal. This is especially problematic for datasets with sparse data (every NaN will be merged to every other NaN, resulting in a huge DataFrame!)

这与我见过的任何 RDB 都不一样,通常缺失值会被视为不可知论,并且不会被合并在一起,就好像它们是相等的一样。这对于具有稀疏数据的数据集尤其成问题(每个 NaN 都将合并到每个其他 NaN,从而产生一个巨大的 DataFrame!)

Is there a way to ignore missing values during a merge without first slicing them out?

有没有办法在合并过程中忽略缺失值而不先将它们切掉?

回答by meloncholy

You could exclude values from bar(and indeed fooif you wanted) where idis null during the merge. Not sure it's what you're after, though, as they are sliced out.

您可以在合并期间从bar(并且确实foo如果您愿意) where idis null 中排除值。不过,不确定这是你想要的,因为它们被切掉了。

(I've assumed from your left join that you're interested in retaining all of foo, but only want to merge the parts of barthat match and are not null.)

(我从您的左连接假设您有兴趣保留所有foo,但只想合并该bar匹配项的部分并且不为空。)

foo.merge(bar[pd.notnull(bar.id)], how='left', on='id')

Out[11]: 
id   x   y   z
0    a   1   2   3
1    b   4   5 NaN
2    c   7   8   9
3  NaN  10  11 NaN

回答by Liang

if do not need NaN in both left and right DF, use

如果在左右 DF 中都不需要 NaN,请使用

pd.merge(foo.dropna(), bar.dropna(), how='left', on='id')

pd.merge(foo.dropna(), bar.dropna(), how='left', on='id')

else if need NaN in left DF, use

否则,如果在左 DF 中需要 NaN,请使用

pd.merge(foo, bar.dropna(), how='left', on='id')

回答by yosemite_k

If You want to preserve the NaNs from both tables without slicing them out, you could use the outer join method as follows:

如果您想保留两个表中的 NaN 而不将它们切掉,您可以使用外连接方法,如下所示:

pd.merge(foo, bar.dropna(), how='outer', on='id')

It basically returns the union of fooand bar

它基本上返回foobar