pandas 数据框视图与副本,我该如何判断?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27367442/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:44:54  来源:igfitidea点击:

pandas dataframe view vs copy, how do I tell?

pythonpandas

提问by user3659451

What's the difference between:

有什么区别:

pandas df.loc[:,('col_a','col_b')]

Pandas df.loc[:,('col_a','col_b')]

and

df.loc[:,['col_a','col_b']]

df.loc[:,['col_a','col_b']]

The link below doesn't mention the latter, though it works. Do both pull a view? Does the first pull a view and the second pull a copy? Love learning Pandas.

下面的链接没有提到后者,尽管它有效。两者都拉视图吗?第一个拉取视图,第二个拉取副本吗?喜欢学习Pandas。

http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Thanks

谢谢

回答by unutbu

If your DataFrame has a simple column index, then there is no difference. For example,

如果您的 DataFrame 有一个简单的列索引,则没有区别。例如,

In [8]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('ABC'))

In [9]: df.loc[:, ['A','B']]
Out[9]: 
   A   B
0  0   1
1  3   4
2  6   7
3  9  10

In [10]: df.loc[:, ('A','B')]
Out[10]: 
   A   B
0  0   1
1  3   4
2  6   7
3  9  10

But if the DataFrame has a MultiIndex, there can be a big difference:

但是如果 DataFrame 有一个 MultiIndex,就会有很大的不同:

df = pd.DataFrame(np.random.randint(10, size=(5,4)),
                  columns=pd.MultiIndex.from_arrays([['foo']*2+['bar']*2,
                                                     list('ABAB')]),
                  index=pd.MultiIndex.from_arrays([['baz']*2+['qux']*3,
                                                   list('CDCDC')]))

#       foo    bar   
#         A  B   A  B
# baz C   7  9   9  9
#     D   7  5   5  4
# qux C   5  0   5  1
#     D   1  7   7  4
#     C   6  4   3  5

In [27]: df.loc[:, ('foo','B')]
Out[27]: 
baz  C    9
     D    5
qux  C    0
     D    7
     C    4
Name: (foo, B), dtype: int64

In [28]: df.loc[:, ['foo','B']]
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'

The KeyError is saying that the MultiIndex has to be lexsorted. If we do that, then we still get a different result:

KeyError 表示必须对 MultiIndex 进行词法排序。如果我们这样做,那么我们仍然会得到不同的结果:

In [29]: df.sortlevel(axis=1).loc[:, ('foo','B')]
Out[29]: 
baz  C    9
     D    5
qux  C    0
     D    7
     C    4
Name: (foo, B), dtype: int64

In [30]: df.sortlevel(axis=1).loc[:, ['foo','B']]
Out[30]: 
      foo   
        A  B
baz C   7  9
    D   7  5
qux C   5  0
    D   1  7
    C   6  4

Why is that? df.sortlevel(axis=1).loc[:, ('foo','B')]is selecting the column where the first column level equals foo, and the second column level is B.

这是为什么?df.sortlevel(axis=1).loc[:, ('foo','B')]正在选择第一列级别等于 的列foo,第二列级别是B

In contrast, df.sortlevel(axis=1).loc[:, ['foo','B']]is selecting the columns where the first column level is either fooor B. With respect to the first column level, there are no Bcolumns, but there are two foocolumns.

相反,df.sortlevel(axis=1).loc[:, ['foo','B']]正在选择第一列级别为foo或 的列B。对于第一列级别,没有B列,但有两foo列。

I think the operating principle with Pandas is that if you use df.loc[...]as an expression, you should assume df.locmay be returning a copy or a view. The Pandas docs do not specify any rules about which you should expect. However, if you make an assignmentof the form

我认为 Pandas 的操作原则是,如果您df.loc[...]用作表达式,您应该假设df.loc可能返回副本或视图。Pandas 文档没有指定您应该期望的任何规则。但是,如果您分配表格

df.loc[...] = value

then you can trust Pandas to alter dfitself.

那么你可以相信 Pandas 会改变df自己。

The reason why the documentation warns about the distinction between views and copies is so that you are aware of the pitfall of using chain assignments of the form

文档警告视图和副本之间的区别的原因是让您意识到使用表单的链分配的陷阱

df.loc[...][...] = value

Here, Pandas evaluates df.loc[...]first, which may be a view or a copy. Now if it is a copy, then

在这里,Pandasdf.loc[...]先求值,可能是一个视图,也可能是一个副本。现在如果它是一个副本,那么

df.loc[...][...] = value

is altering a copy of some portion of df, and thus has no effect on dfitself. To add insult to injury, the effect on the copy is lost as well since there are no references to the copy and thus there is no way to access the copy after the assignment statement completes, and (at least in CPython) it is therefore soon-to-be garbage collected.

正在更改 的某些部分的副本df,因此对其df自身没有影响。雪上加霜的是,对副本的影响也会丢失,因为没有对副本的引用,因此在赋值语句完成后无法访问副本,并且(至少在 CPython 中)因此很快- 待垃圾收集。



I do not know of a practical fool-proof a prioriway to determine if df.loc[...]is going to return a view or a copy.

我不知道确定是否要返回视图或副本的实用的防呆先验方法df.loc[...]

However, there are some rules of thumb which may help guide your intuition (but note that we are talking about implementation details here, so there is no guarantee that Pandas needs to behave this way in the future):

但是,有一些经验法则可能有助于指导您的直觉(但请注意,我们在这里讨论的是实现细节,因此不能保证 Pandas 将来需要以这种方式行事):

  • If the resultant NDFrame can not be expressed as a basic slice of the underlying NumPy array, then it probably will be a copy. Thus, a selection of arbitrary rows or columns will lead to a copy. A selection of sequential rows and/or sequential columns (which may be expressed as a slice) may return a view.
  • If the resultant NDFrame has columns of different dtypes, then df.locwill again probably return a copy.
  • 如果生成的 NDFrame 不能表示为底层 NumPy 数组的基本切片,那么它可能是一个副本。因此,任意行或列的选择将导致复制。对连续行和/或连续列(可以表示为切片)的选择可以返回视图。
  • 如果生成的 NDFrame 具有不同 dtype 的列,则df.loc可能会再次返回副本。

However, there is an easy way to determine if x = df.loc[..]is a view a postiori: Simply see if changing a value in xaffects df. If it does, it is a view, if not, xis a copy.

但是,有一种简单的方法可以确定x = df.loc[..]视图是否为后:只需查看更改中的值是否x会影响df。如果是,则是视图,如果不是,x则是副本。