Pandas - is inplace = True 是否有害?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45570984/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:12:29  来源:igfitidea点击:

Pandas - is inplace = True considered harmful or not?

pythonpandas

提问by OmerB

This has been discussed before, but with conflicting answers:

之前已经讨论过这一点,但答案相互矛盾:

What I'm wondering is:

我想知道的是:

  • Why is inplace = Falsethe default behavior?
  • When is it good to change it? (well, I'm allowed to change it, so I guess there's a reason).
  • Is this a safety issue? that is, can an operation fail/misbehave due to inplace = True?
  • Can I know in advance if a certain inplace = Trueoperation will "really" be carried out in-place?
  • 为什么是inplace = False默认行为?
  • 什么时候换比较好?(好吧,我可以更改它,所以我想这是有原因的)。
  • 这是安全问题吗?也就是说,操作是否会因以下原因失败/行为不端inplace = True
  • 我能否提前知道某项inplace = True操作是否“真的”就地进行?


My take so far:

到目前为止我的看法:

  • Many Pandas operations have an inplaceparameter, always defaulting to False, meaning the original DataFrame is untouched, and the operation returns a new DF.
  • When setting inplace = True, the operation mightwork on the original DF, but it might still work on a copy behind the scenes, and just reassign the reference when done.
  • 许多 Pandas 操作都有一个inplace参数,始终默认为False,这意味着原始 DataFrame 未受影响,并且该操作返回一个新的 DF。
  • 设置 时inplace = True,该操作可能对原始 DF 有效,但它可能仍然对幕后的副本有效,并在完成后重新分配引用。

pros of inplace = False:

优点inplace = False

  • Allows chained/functional syntax: df.dropna().rename().sum()...which is nice, and offers a chance for lazy evaluation or a more efficient re-ordering (though I don't think Pandas is doing this).
  • When using inplace = Trueon an object which is potentially a slice/view of an underlying DF, Pandas has to do a SettingWithCopycheck, which is expensive. inplace = Falseavoids this.
  • Consistent & predictable behavior behind the scenes.
  • 允许链式/函数式语法:df.dropna().rename().sum()...这很好,并提供了延迟评估或更有效的重新排序的机会(尽管我认为 Pandas 不会这样做)。
  • inplace = True在可能是底层 DF 的切片/视图的对象上使用时,Pandas 必须进行SettingWithCopy检查,这很昂贵。inplace = False避免了这一点。
  • 幕后一致且可预测的行为。

pros of inplace = True:

优点inplace = True

  • Can be both faster and less memory hogging (the first link shows reset_index()runs twice as fast and uses half the peak memory!).
  • 可以更快,更少的内存占用(第一个链接显示reset_index()运行速度是原来的两倍,并且使用了一半的峰值内存!)。

So, putting the copy-vs-view issue aside, it seems more performant to always use inplace = True, unless specifically writing a chained statement. But that's not the default Pandas opt for, so what am I missing?

因此,将复制与视图问题放在一边inplace = True,除非专门编写链式语句,否则始终使用 似乎性能更高。但这不是 Pandas 的默认选择,所以我错过了什么?

回答by Jon Clements

If inplacewas the default then the DataFrame would be mutated for all names that currently reference it.

如果inplace是默认值,则 DataFrame 将针对当前引用它的所有名称进行变异。

A simple example, say I have a df:

一个简单的例子,假设我有一个df

df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})

Now it's very important that DataFrame retains that row order - let's say it's from a data source where insertion order is key for instance.

现在,DataFrame 保留该行顺序非常重要 - 假设它来自插入顺序是关键的数据源。

However, I now need to do some operations which require a different sort order:

但是,我现在需要执行一些需要不同排序顺序的操作:

def f(frame):
    df = frame.sort_values('a')
    # if we did frame.sort_values('a', inplace=True) here without
    # making it explicit - our caller is going to wonder what happened
    # do something
    return df

That's fine - my original dfremains the same. However, if inplace=Truewere the default then my original dfwill now be sorted as a side-effect of f()in which I'd have to trust the caller to remember to not do something in placeI'm not expecting instead of deliberately doing something in place... So it's better that anything that can mutate an object in place does so explicitlyto at least make it more obvious what's happened and why.

没关系 - 我的原件df保持不变。但是,如果inplace=True是默认值,那么我的原始文件df现在将作为副作用进行排序,f()在这种情况下,我必须相信调用者会记住不要在我不期望的地方做某事,而不是故意在适当的地方做某事。 .. 所以最好是任何可以在原地改变对象的东西都做得如此明确,至少可以让发生的事情和原因更加明显。

Even with basic Python builtin mutables, you can observe this:

即使使用基本的 Python 内置变量,您也可以观察到:

data = [3, 2, 1]

def f(lst):
    lst.sort()
    # I meant lst = sorted(lst)
    for item in lst:
        print(item)

f(data)

for item in data:
    print(item)

# huh!? What happened to my data - why's it not 3, 2, 1?     

回答by cs95

Don't use inplace=True!

不要用inplace=True

This GitHub issueis proposing the inplaceargument be deprecated api-wide sometime in the near future. In a nutshell, here's everything wrong with the inplaceargument:

这个 GitHub 问题提议inplace在不久的将来某个时候在 api-wide 范围内弃用该论点。简而言之,这里的inplace论点都是错误的:

  • inplace, contrary to what the name implies, often does not prevent copies from being created, and (almost) never offers any performance benefits
  • inplacedoes not work with method chaining
  • inplaceis a common pitfall for beginners, so removing this option will simplify the API
  • inplace,与名称所暗示的相反,通常不会阻止创建副本,并且(几乎)从不提供任何性能优势
  • inplace不适用于方法链
  • inplace是初学者的常见陷阱,因此删除此选项将简化 API

Performance
It is a common misconception that using inplace=Truewill lead to more efficient or optimized code. In general, there no performance benefitsto using inplace=True. Most in-place and out-of-place versions of a method create a copy of the data anyway, with the in-place version automatically assigning the copy back. The copy cannot be avoided.

性能
一个常见的误解是使用inplace=True将导致更高效或优化的代码。在一般情况下,有没有性能优势使用inplace=True。方法的大多数就地和非就地版本无论如何都会创建数据的副本,就地版本会自动将副本分配回来。副本无法避免。

Method Chaining
inplace=Truealso hinders method chaining. Contrast the working of

方法链
inplace=True阻碍方法链接。对比工作

result = df.some_function1().reset_index().some_function2()

As opposed to

temp = df.some_function1()
temp.reset_index(inplace=True)
result = temp.some_function2()

Unintended Pitfalls
One final caveat to keep in mind is that calling inplace=Truecan trigger the SettingWithCopyWarning:

意外陷阱
要记住的最后一个警告是,调用inplace=True可能会触发SettingWithCopyWarning

df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})

df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning: 
# A value is trying to be set on a copy of a slice from a DataFrame

Which can cause unexpected behavior.

这可能会导致意外行为。