pandas 根据条件替换数据框列中的值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26620647/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Replace values in a dataframe column based on condition
提问by ozhogin
I have a seemingly easy task. Dataframe with 2 columns: A and B. If values in B are larger than values in A - replace those values with values of A. I used to do this by doing df.B[df.B > df.A] = df.A, however recent upgrade of pandas started giving a SettingWithCopyWarningwhen encountering this chained assignment. Official documentation recommends using .loc.
我有一项看似简单的任务。具有 2 列的数据框:A 和 B。如果 B 中的值大于 A 中的值 - 将这些值替换为 A 的值。我曾经通过这样做来做到这一点df.B[df.B > df.A] = df.A,但是最近的Pandas升级SettingWithCopyWarning在遇到此链式分配时开始给出一个。官方文档推荐使用.loc.
Okay, I said, and did it through df.loc[df.B > df.A, 'B'] = df.Aand it all works fine, unless column B has all values of NaN. Then something weird happens:
好的,我说,并且完成了df.loc[df.B > df.A, 'B'] = df.A并且一切正常,除非 B 列的所有值都是NaN. 然后奇怪的事情发生了:
In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, np.NaN, np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 NaN
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 -9223372036854775808
1 2 -9223372036854775808
2 3 -9223372036854775808
Now, if even one of B's elements satisfies the condition (larger than A), then it all works fine:
现在,即使 B 的元素之一满足条件(大于 A),那么一切正常:
In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, 4, np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 4
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 NaN
1 2 2
2 3 NaN
But if none of Bs elements satisfy, then all NaNs get replaces with -9223372036854775808:
但是如果 Bs 元素都不满足,则所有NaNs 都将替换为-9223372036854775808:
In [1]: df = pd.DataFrame({'A':[1,2,3],'B':[np.NaN,1,np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 1
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 -9223372036854775808
1 2 1
2 3 -9223372036854775808
Is this a bug or a feature? How should I have done this replacement?
这是错误还是功能?我应该怎么做这个替换?
Thank you!
谢谢!
回答by Jeff
This is a buggie, fixed here.
这是一辆越野车,已在此处修复。
Since pandas allows basically anything to be set on the right-hand-side of an expression in loc, there are probably 10+ cases that need to be disambiguated. To give you an idea:
由于 pandas 基本上允许在 loc 中表达式的右侧设置任何内容,因此可能有 10 多种情况需要消除歧义。给你一个想法:
df.loc[lhs, column] = rhs
where rhs could be: list,array,scalar, and lhs could be: slice,tuple,scalar,array
其中 rhs 可能是:list,array,scalar,而 lhs 可能是:slice,tuple,scalar,array
and a small subset of cases where the resulting dtype of the column needs to be inferred / set according to the rhs. (This is a bit complicated). For example say you don't set all of the elements on the lhs and it was integer, then you need to coerce to float. But if you did set all of the elements AND the rhs was an integer then it needs to be coerced BACK to integer.
以及需要根据rhs推断/设置列的结果dtype的一小部分情况。(这有点复杂)。例如,假设您没有在 lhs 上设置所有元素并且它是整数,那么您需要强制浮动。但是,如果您确实设置了所有元素并且 rhs 是整数,则需要将其强制回整数。
In this this particular case, the lhs is an array, so we would normally try to coerce the lhs to the type of the rhs, but this case degenerates if we have an unsafe conversion (int -> float)
在这种特殊情况下,lhs 是一个数组,因此我们通常会尝试将 lhs 强制转换为 rhs 的类型,但是如果我们有不安全的转换(int -> float),这种情况就会退化
Suffice to say this was a missing edge case.
可以说这是一个缺失的边缘情况。

