Python 我为什么要在熊猫中制作数据框的副本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27673231/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
why should I make a copy of a data frame in pandas
提问by Elizabeth Susan Joseph
When selecting a sub dataframe from a parent dataframe, I noticed that some programmers make a copy of the data frame using the .copy()
method.
从父数据帧中选择子数据帧时,我注意到一些程序员使用该.copy()
方法制作了数据帧的副本。
Why are they making a copy of the data frame? What will happen if I don't make a copy?
他们为什么要复制数据框?如果我不制作副本会怎样?
采纳答案by cgold
This expands on Paul's answer. In Pandas, indexing a DataFrame returns a reference to the initial DataFrame. Thus, changing the subset will change the initial DataFrame. Thus, you'd want to use the copy if you want to make sure the initial DataFrame shouldn't change. Consider the following code:
这扩展了保罗的回答。在 Pandas 中,索引 DataFrame 会返回对初始 DataFrame 的引用。因此,更改子集将更改初始 DataFrame。因此,如果您想确保初始 DataFrame 不应该更改,则需要使用副本。考虑以下代码:
df = DataFrame({'x': [1,2]})
df_sub = df[0:1]
df_sub.x = -1
print(df)
You'll get:
你会得到:
x
0 -1
1 2
In contrast, the following leaves df unchanged:
相比之下,以下内容使 df 保持不变:
df_sub_copy = df[0:1].copy()
df_sub_copy.x = -1
回答by sparrow
Because if you don't make a copy then the indices can still be manipulated elsewhere even if you assign the dataFrame to a different name.
因为如果您不进行复制,那么即使您将 dataFrame 分配给不同的名称,索引仍然可以在其他地方进行操作。
For example:
例如:
df2 = df
func1(df2)
func2(df)
func1 can modify df by modifying df2, so to avoid that:
func1 可以通过修改 df2 来修改 df,所以要避免:
df2 = df.copy()
func1(df2)
func2(df)
回答by Gusev Slava
It's necessary to mention that returning copy or view depends on kind of indexing.
有必要提到返回副本或视图取决于索引类型。
The pandas documentation says:
熊猫文档说:
Returning a view versus a copy
The rules about when a view on the data is returned are entirely dependent on NumPy. Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy. With single label / scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, 'A'], a view will be returned.
返回视图与副本
关于何时返回数据视图的规则完全取决于 NumPy。每当索引操作中涉及标签数组或布尔向量时,结果将是一个副本。对于单标签/标量索引和切片,例如 df.ix[3:6] 或 df.ix[:, 'A'],将返回一个视图。
回答by bojax
In general it is safer to work on copies than on original data frames, except when you know that you won't be needing the original anymore and want to proceed with the manipulated version. Normally, you would still have some use for the original data frame to compare with the manipulated version, etc. Therefore, most people work on copies and merge at the end.
一般来说,处理副本比处理原始数据帧更安全,除非您知道不再需要原始数据帧并希望继续处理已操作的版本。通常,您仍然可以使用原始数据框与操作版本进行比较等。因此,大多数人在最后进行复制和合并。
回答by Cosyn
The primary purpose is to avoid chained indexing and eliminate the SettingWithCopyWarning
.
主要目的是避免链式索引并消除SettingWithCopyWarning
.
Here chained indexing is something like dfc['A'][0] = 111
这里链式索引类似于 dfc['A'][0] = 111
The document said chained indexing should be avoided in Returning a view versus a copy. Here is a slightly modified example from that document:
该文档说在Returning a view vs a copy 中应该避免链式索引。这是该文档中稍微修改的示例:
In [1]: import pandas as pd
In [2]: dfc = pd.DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]})
In [3]: dfc
Out[3]:
A B
0 aaa 1
1 bbb 2
2 ccc 3
In [4]: aColumn = dfc['A']
In [5]: aColumn[0] = 111
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
In [6]: dfc
Out[6]:
A B
0 111 1
1 bbb 2
2 ccc 3
Here the aColumn
is a view and not a copy from the original DataFrame, so modifying aColumn
will cause the original dfc
be modified too. Next, if we index the row first:
这里aColumn
是一个视图,而不是原始 DataFrame 的副本,因此修改aColumn
也会导致原始数据dfc
被修改。接下来,如果我们先索引该行:
In [7]: zero_row = dfc.loc[0]
In [8]: zero_row['A'] = 222
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
In [9]: dfc
Out[9]:
A B
0 111 1
1 bbb 2
2 ccc 3
This time zero_row
is a copy, so the original dfc
is not modified.
这次zero_row
是副本,所以dfc
没有修改原文。
From these two examples above, we see it's ambiguous whether or not you want to change the original DataFrame. This is especially dangerous if you write something like the following:
从上面这两个例子中,我们看到是否要更改原始DataFrame是模棱两可的。如果您编写如下内容,这尤其危险:
In [10]: dfc.loc[0]['A'] = 333
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
In [11]: dfc
Out[11]:
A B
0 111 1
1 bbb 2
2 ccc 3
This time it didn't work at all. Here we wanted to change dfc
, but we actually modified an intermediate value dfc.loc[0]
that is a copy and is discarded immediately. It's very hard to predict whether the intermediate value like dfc.loc[0]
or dfc['A']
is a view or a copy, so it's not guaranteed whether or not original DataFrame will be updated. That's why chained indexing should be avoided, and pandas generates the SettingWithCopyWarning
for this kind of chained indexing update.
这一次它根本不起作用。在这里,我们想更改dfc
,但实际上我们修改了一个中间值dfc.loc[0]
,该值是一个副本并立即被丢弃。这是很难预测的,如中间值是否dfc.loc[0]
或者dfc['A']
是一个视图或副本,因此不能保证原始数据帧是否会被更新。这就是应该避免链式索引的原因,而 Pandas 会SettingWithCopyWarning
为这种链式索引更新生成 。
Now is the use of .copy()
. To eliminate the warning, make a copy to express your intention explicitly:
现在是使用.copy()
. 要消除警告,请复制以明确表达您的意图:
In [12]: zero_row_copy = dfc.loc[0].copy()
In [13]: zero_row_copy['A'] = 444 # This time no warning
Since you are modifying a copy, you know the original dfc
will never change and you are not expecting it to change. Your expectation matches the behavior, then the SettingWithCopyWarning
disappears.
由于您正在修改副本,因此您知道原件dfc
永远不会更改,并且您不期望它会更改。你的期望与行为相匹配,然后SettingWithCopyWarning
消失。
Note, If you do want to modify the original DataFrame, the document suggests you use loc
:
注意,如果您确实想修改原始 DataFrame,文档建议您使用loc
:
In [14]: dfc.loc[0,'A'] = 555
In [15]: dfc
Out[15]:
A B
0 555 1
1 bbb 2
2 ccc 3