为什么使用 pandas.assign 而不是简单地初始化新列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48177914/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why use pandas.assign rather than simply initialize new column?
提问by sacuL
I just discovered the assign
method for pandas dataframes, and it looks nice and very similar to dplyr's mutate
in R. However, I've always gotten by by just initializing a new column 'on the fly'. Is there a reason why assign
is better?
我刚刚发现了assign
用于 Pandas 数据框的方法,它看起来不错,并且与mutate
R 中的dplyr 非常相似。但是,我总是通过“即时”初始化一个新列来解决问题。有assign
更好的理由吗?
For instance (based on the example in the pandas documentation), to create a new column in a dataframe, I could just do this:
例如(基于Pandas文档中的示例),要在数据框中创建一个新列,我可以这样做:
df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
df['ln_A'] = np.log(df['A'])
but the pandas.DataFrame.assign
documentation recommends doing this:
但pandas.DataFrame.assign
文档建议这样做:
df.assign(ln_A = lambda x: np.log(x.A))
# or
newcol = np.log(df['A'])
df.assign(ln_A=newcol)
Both methods return the same dataframe. In fact, the first method (my 'on the fly' method) is significantly faster (0.20225788200332318 seconds for 1000 iterations) than the .assign
method (0.3526602769998135 seconds for 1000 iterations).
两种方法都返回相同的数据帧。事实上,第一种方法(我的“即时”方法)明显快于.assign
方法(1000 次迭代为 0.20225788200332318 秒)比方法(1000 次迭代为 0.3526602769998135 秒)。
So is there a reason I should stop using my old method in favour of df.assign
?
那么我是否有理由停止使用我的旧方法来支持df.assign
?
回答by donkopotamus
The difference concerns whether you wish to modifyan existing frame, or create a new framewhile maintaining the original frame as it was.
区别在于您是希望修改现有框架,还是在保持原始框架原样的同时创建新框架。
In particular, DataFrame.assign
returns you a newobject that has a copy of the original data with the requested changes ... the original frame remains unchanged.
特别是,DataFrame.assign
返回一个新对象,该对象具有原始数据的副本以及所请求的更改......原始框架保持不变。
In your particular case:
在您的特定情况下:
>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
Now suppose you wish to create a new frame in which A
is everywhere 1
without destroying df
. Then you could use .assign
现在假设您希望创建一个新框架,其中A
无处不在1
而不破坏df
. 然后你可以使用.assign
>>> new_df = df.assign(A=1)
If you do not wish to maintain the original values, then clearly df["A"] = 1
will be more appropriate. This also explains the speed difference, by necessity .assign
must copy the data while [...]
does not.
如果不想保持原来的数值,那么cleardf["A"] = 1
会更合适。这也解释了速度差异,必然.assign
要复制数据而[...]
不必。
回答by prosti
The premise on assign
is that it returns:
前提assign
是它返回:
A new DataFrame with the new columns in addition to all the existing columns.
除了所有现有列之外,还包含新列的新 DataFrame。
And also you cannot do anything in-place to change the original dataframe.
而且您也无法就地更改原始数据框。
The callable must not change input DataFrame (though pandas doesn't check it).
可调用对象不得更改输入数据帧(尽管大Pandas不检查它)。
On the other hand df['ln_A'] = np.log(df['A'])
will do things inplace.
另一方面df['ln_A'] = np.log(df['A'])
会做事就地。
So is there a reason I should stop using my old method in favour of
df.assign
?
那么我是否有理由停止使用我的旧方法来支持
df.assign
?
I think you can try df.assign
but if you do memory intensive stuff, better to work what you did before or operations with inplace=True
.
我想你可以尝试,df.assign
但如果你做内存密集型的事情,最好是你以前做过的工作或使用inplace=True
.