为什么使用 pandas.assign 而不是简单地初始化新列?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48177914/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:02:28  来源:igfitidea点击:

Why use pandas.assign rather than simply initialize new column?

pythonpandas

提问by sacuL

I just discovered the assignmethod for pandas dataframes, and it looks nice and very similar to dplyr's mutatein R. However, I've always gotten by by just initializing a new column 'on the fly'. Is there a reason why assignis better?

我刚刚发现了assign用于 Pandas 数据框的方法,它看起来不错,并且与mutateR 中的dplyr 非常相似。但是,我总是通过“即时”初始化一个新列来解决问题。有assign更好的理由吗?

For instance (based on the example in the pandas documentation), to create a new column in a dataframe, I could just do this:

例如(基于Pandas文档中的示例),要在数据框中创建一个新列,我可以这样做:

df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
df['ln_A'] = np.log(df['A'])

but the pandas.DataFrame.assigndocumentation recommends doing this:

pandas.DataFrame.assign文档建议这样做:

df.assign(ln_A = lambda x: np.log(x.A))
# or 
newcol = np.log(df['A'])
df.assign(ln_A=newcol)

Both methods return the same dataframe. In fact, the first method (my 'on the fly' method) is significantly faster (0.20225788200332318 seconds for 1000 iterations) than the .assignmethod (0.3526602769998135 seconds for 1000 iterations).

两种方法都返回相同的数据帧。事实上,第一种方法(我的“即时”方法)明显快于.assign方法(1000 次迭代为 0.20225788200332318 秒)比方法(1000 次迭代为 0.3526602769998135 秒)。

So is there a reason I should stop using my old method in favour of df.assign?

那么我是否有理由停止使用我的旧方法来支持df.assign

回答by donkopotamus

The difference concerns whether you wish to modifyan existing frame, or create a new framewhile maintaining the original frame as it was.

区别在于您是希望修改现有框架,还是在保持原始框架原样的同时创建新框架

In particular, DataFrame.assignreturns you a newobject that has a copy of the original data with the requested changes ... the original frame remains unchanged.

特别是,DataFrame.assign返回一个对象,该对象具有原始数据的副本以及所请求的更改......原始框架保持不变

In your particular case:

在您的特定情况下:

>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})

Now suppose you wish to create a new frame in which Ais everywhere 1without destroying df. Then you could use .assign

现在假设您希望创建一个新框架,其中A无处不在1而不破坏df. 然后你可以使用.assign

>>> new_df = df.assign(A=1)

If you do not wish to maintain the original values, then clearly df["A"] = 1will be more appropriate. This also explains the speed difference, by necessity .assignmust copy the data while [...]does not.

如果不想保持原来的数值,那么cleardf["A"] = 1会更合适。这也解释了速度差异,必然.assign要复制数据而[...]不必。

回答by prosti

The premise on assignis that it returns:

前提assign是它返回:

A new DataFrame with the new columns in addition to all the existing columns.

除了所有现有列之外,还包含新列的新 DataFrame。

And also you cannot do anything in-place to change the original dataframe.

而且您也无法就地更改原始数据框。

The callable must not change input DataFrame (though pandas doesn't check it).

可调用对象不得更改输入数据帧(尽管大Pandas不检查它)。

On the other hand df['ln_A'] = np.log(df['A'])will do things inplace.

另一方面df['ln_A'] = np.log(df['A'])会做事就地。



So is there a reason I should stop using my old method in favour of df.assign?

那么我是否有理由停止使用我的旧方法来支持df.assign

I think you can try df.assignbut if you do memory intensive stuff, better to work what you did before or operations with inplace=True.

我想你可以尝试,df.assign但如果你做内存密集型的事情,最好是你以前做过的工作或使用inplace=True.