Python 我为什么要在熊猫中制作数据框的副本

Question

提问by Elizabeth Susan Joseph

When selecting a sub dataframe from a parent dataframe, I noticed that some programmers make a copy of the data frame using the .copy()method.

从父数据帧中选择子数据帧时，我注意到一些程序员使用该.copy()方法制作了数据帧的副本。

Why are they making a copy of the data frame? What will happen if I don't make a copy?

他们为什么要复制数据框？如果我不制作副本会怎样？

Answer 1

采纳答案by cgold

This expands on Paul's answer. In Pandas, indexing a DataFrame returns a reference to the initial DataFrame. Thus, changing the subset will change the initial DataFrame. Thus, you'd want to use the copy if you want to make sure the initial DataFrame shouldn't change. Consider the following code:

这扩展了保罗的回答。在 Pandas 中，索引 DataFrame 会返回对初始 DataFrame 的引用。因此，更改子集将更改初始 DataFrame。因此，如果您想确保初始 DataFrame 不应该更改，则需要使用副本。考虑以下代码：

df = DataFrame({'x': [1,2]})
df_sub = df[0:1]
df_sub.x = -1
print(df)

You'll get:

你会得到：

x
0 -1
1  2

In contrast, the following leaves df unchanged:

相比之下，以下内容使 df 保持不变：

df_sub_copy = df[0:1].copy()
df_sub_copy.x = -1

Answer 2

回答by sparrow

Because if you don't make a copy then the indices can still be manipulated elsewhere even if you assign the dataFrame to a different name.

因为如果您不进行复制，那么即使您将 dataFrame 分配给不同的名称，索引仍然可以在其他地方进行操作。

For example:

例如：

df2 = df
func1(df2)
func2(df)

func1 can modify df by modifying df2, so to avoid that:

func1 可以通过修改 df2 来修改 df，所以要避免：

df2 = df.copy()
func1(df2)
func2(df)

Answer 3

回答by Gusev Slava

It's necessary to mention that returning copy or view depends on kind of indexing.

有必要提到返回副本或视图取决于索引类型。

The pandas documentation says:

熊猫文档说：

Returning a view versus a copy
The rules about when a view on the data is returned are entirely dependent on NumPy. Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy. With single label / scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, 'A'], a view will be returned.

返回视图与副本
关于何时返回数据视图的规则完全取决于 NumPy。每当索引操作中涉及标签数组或布尔向量时，结果将是一个副本。对于单标签/标量索引和切片，例如 df.ix[3:6] 或 df.ix[:, 'A']，将返回一个视图。

Answer 4

回答by bojax

In general it is safer to work on copies than on original data frames, except when you know that you won't be needing the original anymore and want to proceed with the manipulated version. Normally, you would still have some use for the original data frame to compare with the manipulated version, etc. Therefore, most people work on copies and merge at the end.

一般来说，处理副本比处理原始数据帧更安全，除非您知道不再需要原始数据帧并希望继续处理已操作的版本。通常，您仍然可以使用原始数据框与操作版本进行比较等。因此，大多数人在最后进行复制和合并。

Answer 5

回答by Cosyn

The primary purpose is to avoid chained indexing and eliminate the SettingWithCopyWarning.

主要目的是避免链式索引并消除SettingWithCopyWarning.

Here chained indexing is something like dfc['A'][0] = 111

这里链式索引类似于 dfc['A'][0] = 111

The document said chained indexing should be avoided in Returning a view versus a copy. Here is a slightly modified example from that document:

该文档说在Returning a view vs a copy 中应该避免链式索引。这是该文档中稍微修改的示例：

In [1]: import pandas as pd

In [2]: dfc = pd.DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]})

In [3]: dfc
Out[3]:
    A   B
0   aaa 1
1   bbb 2
2   ccc 3

In [4]: aColumn = dfc['A']

In [5]: aColumn[0] = 111
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

In [6]: dfc
Out[6]:
    A   B
0   111 1
1   bbb 2
2   ccc 3

Here the aColumnis a view and not a copy from the original DataFrame, so modifying aColumnwill cause the original dfcbe modified too. Next, if we index the row first:

这里aColumn是一个视图，而不是原始 DataFrame 的副本，因此修改aColumn也会导致原始数据dfc被修改。接下来，如果我们先索引该行：

In [7]: zero_row = dfc.loc[0]

In [8]: zero_row['A'] = 222
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

In [9]: dfc
Out[9]:
    A   B
0   111 1
1   bbb 2
2   ccc 3

This time zero_rowis a copy, so the original dfcis not modified.

这次zero_row是副本，所以dfc没有修改原文。

From these two examples above, we see it's ambiguous whether or not you want to change the original DataFrame. This is especially dangerous if you write something like the following:

从上面这两个例子中，我们看到是否要更改原始DataFrame是模棱两可的。如果您编写如下内容，这尤其危险：

In [10]: dfc.loc[0]['A'] = 333
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

In [11]: dfc
Out[11]:
    A   B
0   111 1
1   bbb 2
2   ccc 3

This time it didn't work at all. Here we wanted to change dfc, but we actually modified an intermediate value dfc.loc[0]that is a copy and is discarded immediately. It's very hard to predict whether the intermediate value like dfc.loc[0]or dfc['A']is a view or a copy, so it's not guaranteed whether or not original DataFrame will be updated. That's why chained indexing should be avoided, and pandas generates the SettingWithCopyWarningfor this kind of chained indexing update.

这一次它根本不起作用。在这里，我们想更改dfc，但实际上我们修改了一个中间值dfc.loc[0]，该值是一个副本并立即被丢弃。这是很难预测的，如中间值是否dfc.loc[0]或者dfc['A']是一个视图或副本，因此不能保证原始数据帧是否会被更新。这就是应该避免链式索引的原因，而 Pandas 会SettingWithCopyWarning为这种链式索引更新生成。

Now is the use of .copy(). To eliminate the warning, make a copy to express your intention explicitly:

现在是使用.copy(). 要消除警告，请复制以明确表达您的意图：

In [12]: zero_row_copy = dfc.loc[0].copy()

In [13]: zero_row_copy['A'] = 444 # This time no warning

Since you are modifying a copy, you know the original dfcwill never change and you are not expecting it to change. Your expectation matches the behavior, then the SettingWithCopyWarningdisappears.

由于您正在修改副本，因此您知道原件dfc永远不会更改，并且您不期望它会更改。你的期望与行为相匹配，然后SettingWithCopyWarning消失。

Note, If you do want to modify the original DataFrame, the document suggests you use loc:

注意，如果您确实想修改原始 DataFrame，文档建议您使用loc：

In [14]: dfc.loc[0,'A'] = 555

In [15]: dfc
Out[15]:
    A   B
0   555 1
1   bbb 2
2   ccc 3

Python 我为什么要在熊猫中制作数据框的副本

提问by Elizabeth Susan Joseph

采纳答案by cgold

回答by sparrow

回答by Gusev Slava

回答by bojax

回答by Cosyn

相关推荐

最近更新

标签

Python 我为什么要在熊猫中制作数据框的副本

提问by Elizabeth Susan Joseph

采纳答案by cgold

回答by sparrow

回答by Gusev Slava

回答by bojax

回答by Cosyn

相关推荐

Python psycopg2 超时

Python 如果满足某些条件，则从元组列表中删除元组

Python Pyspark --py-files 不起作用

Python 创建一个零填充的熊猫数据框

相关推荐

最近更新

标签