为什么使用 pandas.assign 而不是简单地初始化新列？

Question

提问by sacuL

I just discovered the assignmethod for pandas dataframes, and it looks nice and very similar to dplyr's mutatein R. However, I've always gotten by by just initializing a new column 'on the fly'. Is there a reason why assignis better?

我刚刚发现了assign用于 Pandas 数据框的方法，它看起来不错，并且与mutateR 中的dplyr 非常相似。但是，我总是通过“即时”初始化一个新列来解决问题。有assign更好的理由吗？

For instance (based on the example in the pandas documentation), to create a new column in a dataframe, I could just do this:

例如（基于Pandas文档中的示例），要在数据框中创建一个新列，我可以这样做：

df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
df['ln_A'] = np.log(df['A'])

but the pandas.DataFrame.assigndocumentation recommends doing this:

但pandas.DataFrame.assign文档建议这样做：

df.assign(ln_A = lambda x: np.log(x.A))
# or 
newcol = np.log(df['A'])
df.assign(ln_A=newcol)

Both methods return the same dataframe. In fact, the first method (my 'on the fly' method) is significantly faster (0.20225788200332318 seconds for 1000 iterations) than the .assignmethod (0.3526602769998135 seconds for 1000 iterations).

两种方法都返回相同的数据帧。事实上，第一种方法（我的“即时”方法）明显快于.assign方法（1000 次迭代为 0.20225788200332318 秒）比方法（1000 次迭代为 0.3526602769998135 秒）。

So is there a reason I should stop using my old method in favour of df.assign?

那么我是否有理由停止使用我的旧方法来支持df.assign？

Answer 1

回答by donkopotamus

The difference concerns whether you wish to modifyan existing frame, or create a new framewhile maintaining the original frame as it was.

区别在于您是希望修改现有框架，还是在保持原始框架原样的同时创建新框架。

In particular, DataFrame.assignreturns you a newobject that has a copy of the original data with the requested changes ... the original frame remains unchanged.

特别是，DataFrame.assign返回一个新对象，该对象具有原始数据的副本以及所请求的更改......原始框架保持不变。

In your particular case:

在您的特定情况下：

>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})

Now suppose you wish to create a new frame in which Ais everywhere 1without destroying df. Then you could use .assign

现在假设您希望创建一个新框架，其中A无处不在1而不破坏df. 然后你可以使用.assign

>>> new_df = df.assign(A=1)

If you do not wish to maintain the original values, then clearly df["A"] = 1will be more appropriate. This also explains the speed difference, by necessity .assignmust copy the data while [...]does not.

如果不想保持原来的数值，那么cleardf["A"] = 1会更合适。这也解释了速度差异，必然.assign要复制数据而[...]不必。

Answer 2

回答by prosti

The premise on assignis that it returns:

前提assign是它返回：

A new DataFrame with the new columns in addition to all the existing columns.

除了所有现有列之外，还包含新列的新 DataFrame。

And also you cannot do anything in-place to change the original dataframe.

而且您也无法就地更改原始数据框。

The callable must not change input DataFrame (though pandas doesn't check it).

可调用对象不得更改输入数据帧（尽管大Pandas不检查它）。

On the other hand df['ln_A'] = np.log(df['A'])will do things inplace.

另一方面df['ln_A'] = np.log(df['A'])会做事就地。

So is there a reason I should stop using my old method in favour of df.assign?

那么我是否有理由停止使用我的旧方法来支持df.assign？

I think you can try df.assignbut if you do memory intensive stuff, better to work what you did before or operations with inplace=True.

我想你可以尝试，df.assign但如果你做内存密集型的事情，最好是你以前做过的工作或使用inplace=True.

为什么使用 pandas.assign 而不是简单地初始化新列？

提问by sacuL

回答by donkopotamus

回答by prosti

相关推荐

最近更新

标签

为什么使用 pandas.assign 而不是简单地初始化新列？

提问by sacuL

回答by donkopotamus

回答by prosti

相关推荐

pandas 熊猫使用日期和另一列合并两列

使用 read_excel 和转换器将 Excel 文件读入 Pandas DataFrame 会生成对象类型的数字列

pandas 熊猫根据布尔条件选择行和列

pandas 按月份名称对熊猫的数据框系列进行排序？

相关推荐

最近更新

标签