pandas 根据条件替换数据框列中的值

Question

提问by ozhogin

I have a seemingly easy task. Dataframe with 2 columns: A and B. If values in B are larger than values in A - replace those values with values of A. I used to do this by doing df.B[df.B > df.A] = df.A, however recent upgrade of pandas started giving a SettingWithCopyWarningwhen encountering this chained assignment. Official documentation recommends using .loc.

我有一项看似简单的任务。具有 2 列的数据框：A 和 B。如果 B 中的值大于 A 中的值 - 将这些值替换为 A 的值。我曾经通过这样做来做到这一点df.B[df.B > df.A] = df.A，但是最近的Pandas升级SettingWithCopyWarning在遇到此链式分配时开始给出一个。官方文档推荐使用.loc.

Okay, I said, and did it through df.loc[df.B > df.A, 'B'] = df.Aand it all works fine, unless column B has all values of NaN. Then something weird happens:

好的，我说，并且完成了df.loc[df.B > df.A, 'B'] = df.A并且一切正常，除非 B 列的所有值都是NaN. 然后奇怪的事情发生了：

In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, np.NaN, np.NaN]})

In [2]: df
Out[2]: 
   A   B
0  1 NaN
1  2 NaN
2  3 NaN

In [3]: df.loc[df.B > df.A, 'B'] = df.A

In [4]: df
Out[4]: 
   A                    B
0  1 -9223372036854775808
1  2 -9223372036854775808
2  3 -9223372036854775808

Now, if even one of B's elements satisfies the condition (larger than A), then it all works fine:

现在，即使 B 的元素之一满足条件（大于 A），那么一切正常：

In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, 4, np.NaN]})

In [2]: df
Out[2]: 
   A   B
0  1 NaN
1  2   4
2  3 NaN

In [3]: df.loc[df.B > df.A, 'B'] = df.A

In [4]: df
Out[4]: 
   A   B
0  1 NaN
1  2   2
2  3 NaN

But if none of Bs elements satisfy, then all NaNs get replaces with -9223372036854775808:

但是如果 Bs 元素都不满足，则所有NaNs 都将替换为-9223372036854775808：

In [1]: df = pd.DataFrame({'A':[1,2,3],'B':[np.NaN,1,np.NaN]})

In [2]: df
Out[2]: 
   A   B
0  1 NaN
1  2   1
2  3 NaN

In [3]: df.loc[df.B > df.A, 'B'] = df.A

In [4]: df
Out[4]: 
   A                    B
0  1 -9223372036854775808
1  2                    1
2  3 -9223372036854775808

Is this a bug or a feature? How should I have done this replacement?

这是错误还是功能？我应该怎么做这个替换？

Thank you!

谢谢！

Answer 1

回答by Jeff

This is a buggie, fixed here.

这是一辆越野车，已在此处修复。

Since pandas allows basically anything to be set on the right-hand-side of an expression in loc, there are probably 10+ cases that need to be disambiguated. To give you an idea:

由于 pandas 基本上允许在 loc 中表达式的右侧设置任何内容，因此可能有 10 多种情况需要消除歧义。给你一个想法：

df.loc[lhs, column] = rhs

where rhs could be: list,array,scalar, and lhs could be: slice,tuple,scalar,array

其中 rhs 可能是：list,array,scalar，而 lhs 可能是：slice,tuple,scalar,array

and a small subset of cases where the resulting dtype of the column needs to be inferred / set according to the rhs. (This is a bit complicated). For example say you don't set all of the elements on the lhs and it was integer, then you need to coerce to float. But if you did set all of the elements AND the rhs was an integer then it needs to be coerced BACK to integer.

以及需要根据rhs推断/设置列的结果dtype的一小部分情况。（这有点复杂）。例如，假设您没有在 lhs 上设置所有元素并且它是整数，那么您需要强制浮动。但是，如果您确实设置了所有元素并且 rhs 是整数，则需要将其强制回整数。

In this this particular case, the lhs is an array, so we would normally try to coerce the lhs to the type of the rhs, but this case degenerates if we have an unsafe conversion (int -> float)

在这种特殊情况下，lhs 是一个数组，因此我们通常会尝试将 lhs 强制转换为 rhs 的类型，但是如果我们有不安全的转换（int -> float），这种情况就会退化

Suffice to say this was a missing edge case.

可以说这是一个缺失的边缘情况。

pandas 根据条件替换数据框列中的值

提问by ozhogin

回答by Jeff

相关推荐

最近更新

标签

pandas 根据条件替换数据框列中的值

提问by ozhogin

回答by Jeff

相关推荐

可以在 Pandas 中执行只选择右侧第一个匹配项的左连接吗？

使用 python/pandas 在 excel 上创建颜色渐变的最简单方法？

pandas Seaborn 调色板 - 防止颜色回收

pandas.read_sql_query() 如何查询 TEMP 表？

相关推荐

最近更新

标签