pandas 熊猫从另一个数据帧填充数据帧中的缺失值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29357379/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:08:38  来源:igfitidea点击:

Pandas fill missing values in dataframe from another dataframe

pythonpandas

提问by user308827

I cannot find a pandas function (which I had seen before) to substitute the NaN's in a dataframe with values from another dataframe (assuming a common index which can be specified). Any help?

我找不到一个 Pandas 函数(我以前见过)来用另一个数据帧中的值替换数据帧中的 NaN(假设可以指定一个公共索引)。有什么帮助吗?

回答by Jonathan Eunice

If you have two DataFrames of the same shape, then:

如果您有两个相同形状的 DataFrame,则:

df[df.isnull()] = d2

Will do the trick.

会做的伎俩。

visual representation

视觉表现

Only locations where df.isnull()evaluates to True(highlighted in green) will be eligible for assignment.

只有df.isnull()评估为True(以绿色突出显示)的位置才有资格分配。

In practice, the DataFrames aren't always the same size / shape, and transforming methods (especially .shift()) are useful.

实际上,DataFrame 的大小/形状并不总是相同,并且转换方法(尤其是.shift())很有用。

Data coming in is invariably dirty, incomplete, or inconsistent. Par for the course. There's a pretty extensive pandas tutorial and associated cookbookfor dealing with these situations.

传入的数据总是脏的、不完整的或不一致的。课程标准。有一个非常广泛的 Pandas教程和相关的食谱来处理这些情况。

回答by Anaphory

As I just learned, there is a DataFrame.combine_first()method, which does precisely this, with the additional property that if your updating data frame d2is bigger than your original df, the additional rows and columns are added, as well.

正如我刚刚了解到的,有一种DataFrame.combine_first()方法可以做到这一点,它具有附加属性,即如果您的更新数据框d2大于原始数据框df,还会添加额外的行和列。

df = df.combine_first(d2)

回答by piRSquared

This should be as simple as

这应该很简单

df.fillna(d2)

回答by John Prior

DataFrame.combine_first()answers this question exactly.

DataFrame.combine_first()准确地回答了这个问题。

However, sometimes you want to fill/replace/overwrite some of the non-missing (non-NaN) values of DataFrame A with values from DataFrame B. That question brought me to this page, and the solution is DataFrame.mask()

但是,有时您想用 DataFrame B 的值填充/替换/覆盖 DataFrame A 的一些非缺失(非 NaN)值。这个问题让我来到了这个页面,解决方案是DataFrame.mask()

A = B.mask(condition, A)

When conditionis true, the values from A will be used, otherwise B's values will be used.

condition为真时,将使用 A 的值,否则将使用 B 的值。

For example, you could solve the OP's original question with masksuch that when an element from A is non-NaN, use it, otherwise use the corresponding element from B.

例如,您可以解决 OP 的原始问题mask,当 A 中的元素为非 NaN 时,使用它,否则使用 B 中的相应元素。

But using DataFrame.mask()you could replace the values of A that fail to meet arbitrary criteria (less than zero? more than 100?) with values from B. So maskis more flexible, and overkill for this problem, but I thought it was worthy of mention (I needed it to solve my problem).

但是使用DataFrame.mask()你可以用来自 B 的值替换不满足任意标准(小于零?超过 100?)的 A 的值。所以mask更灵活,对于这个问题来说太过分了,但我认为它值得一提(我需要它来解决我的问题)。

It's also important to note that B could be a numpy array instead of a DataFrame. DataFrame.combine_first()requires that B be a DataFrame, but DataFrame.mask()just requires that B's is an NDFrame and its dimensions match A's dimensions.

同样重要的是要注意 B 可能是一个 numpy 数组而不是 DataFrame。DataFrame.combine_first()要求 B 是 DataFrame,但DataFrame.mask()只要求 B 是 NDFrame 并且其尺寸与 A 的尺寸匹配。

回答by Erfan

A dedicated method for this is DataFrame.update:

一个专门的方法是DataFrame.update

Quoted from the documentation:

引用自文档:

Modify in place using non-NA values from another DataFrame.
Aligns on indices. There is no return value.

使用来自另一个 DataFrame 的非 NA 值就地修改。
在索引上对齐。没有返回值。

Important to note is that this method will modify your data inplace. So it will overwrite your updated dataframe.

需要注意的是,此方法将就地修改您的数据。所以它会覆盖你更新的数据框。

Example:

示例

print(df1)
       A    B     C
aaa  NaN  1.0   NaN
bbb  NaN  NaN  10.0
ccc  3.0  NaN   6.0
ddd  NaN  NaN   NaN
eee  NaN  NaN   NaN

print(df2)
         A    B     C
index                
aaa    1.0  1.0   NaN
bbb    NaN  NaN  10.0
eee    NaN  1.0   NaN

# update df1 NaN where there are values in df2
df1.update(df2)
print(df1)
       A    B     C
aaa  1.0  1.0   NaN
bbb  NaN  NaN  10.0
ccc  3.0  NaN   6.0
ddd  NaN  NaN   NaN
eee  NaN  1.0   NaN

Notice the updated NaNvalues at intersect aaa, Aand eee, B

注意NaN相交处的更新值aaa, Aeee, B