在 Pandas 中检查数据框是复制还是查看

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26879073/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:39:35  来源:igfitidea点击:

Checking whether data frame is copy or view in Pandas

pythonpandaschained-assignment

提问by nick_eu

Is there an easy way to check whether two data frames are different copies or views of the same underlying data that doesn't involve manipulations? I'm trying to get a grip on when each is generated, and given how idiosyncratic the rules seem to be, I'd like an easy way to test.

有没有一种简单的方法来检查两个数据框是否是不涉及操作的相同底层数据的不同副本或视图?我试图掌握每个生成的时间,并且考虑到规则似乎有多么特殊,我想要一种简单的测试方法。

For example, I thought "id(df.values)" would be stable across views, but they don't seem to be:

例如,我认为“id(df.values)”在不同视图中是稳定的,但它们似乎不是:

# Make two data frames that are views of same data.
df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], index = ['row1','row2'], 
       columns = ['a','b','c','d'])
df2 = df.iloc[0:2,:]

# Demonstrate they are views:
df.iloc[0,0] = 99
df2.iloc[0,0]
Out[70]: 99

# Now try and compare the id on values attribute
# Different despite being views! 

id(df.values)
Out[71]: 4753564496

id(df2.values)
Out[72]: 4753603728

# And we can of course compare df and df2
df is df2
Out[73]: False

Other answers I've looked up that try to give rules, but don't seem consistent, and also don't answer this question of how to test:

我查过的其他答案试图给出规则,但似乎不一致,也没有回答这个如何测试的问题:

And of course: - http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy

当然: - http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy

UPDATE:Comments below seem to answer the question -- looking at the df.values.baseattribute rather than df.valuesattribute does it, as does a reference to the df._is_copyattribute (though the latter is probably very bad form since it's an internal).

更新:下面的评论似乎回答了这个问题——查看df.values.base属性而不是df.values属性,就像对df._is_copy属性的引用一样(尽管后者可能是非常糟糕的形式,因为它是内部的)。

采纳答案by nick_eu

Answers from HYRY and Marius in comments!

HYRY 和 Marius 在评论中的回答!

One can check either by:

可以通过以下任一方式进行检查:

  • testing equivalence of the values.baseattribute rather than the valuesattribute, as in:

    df.values.base is df2.values.baseinstead of df.values is df2.values.

  • or using the (admittedly internal) _is_viewattribute (df2._is_viewis True).
  • 测试values.base属性的等效性而不是values属性,如:

    df.values.base is df2.values.base而不是df.values is df2.values.

  • 或使用(公认的内部)_is_view属性(df2._is_viewis True)。

Thanks everyone!

谢谢大家!

回答by ascripter

I've elaborated on this example with pandas 1.0.1. There's not only a boolean _is_viewattribute, but also _is_copywhich can be Noneor a reference to the original DataFrame:

我已经用pandas 1.0.1详细说明了这个例子。不仅有一个布尔_is_view属性,而且_is_copy它可以是None原始数据帧的或引用:

df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], index = ['row1','row2'], 
        columns = ['a','b','c','d'])
df2 = df.iloc[0:2, :]
df3 = df.loc[df['a'] == 1, :]

# df is neither copy nor view
df._is_view, df._is_copy
Out[1]: (False, None)

# df2 is a view AND a copy
df2._is_view, df2._is_copy
Out[2]: (True, <weakref at 0x00000236635C2228; to 'DataFrame' at 0x00000236635DAA58>)

# df3 is not a view, but a copy
df3._is_view, df3._is_copy
Out[3]: (False, <weakref at 0x00000236635C2228; to 'DataFrame' at 0x00000236635DAA58>)

So checking these two attributes should tell you not only if you're dealing with a viewor not, but also if you have a copy or an "original" DataFrame.

因此,检查这两个属性不仅可以告诉您是否正在处理视图,还可以告诉您是否有副本或“原始”DataFrame。

See also this threadfor a discussion explaining why you can't always predict whether your code will return a view or not.

另请参阅此线程以了解为什么您不能总是预测您的代码是否会返回视图的讨论。

回答by Thomas Kimber

You might trace the memory your pandas/python environment is consuming, and, on the assumption that a copy will utilise more memory than a view, be able to decide one way or another.

您可以跟踪您的 Pandas/python 环境正在消耗的内存,并且假设副本将使用比视图更多的内存,能够决定一种或另一种方式。

I believe there are libraries out there that will present the memory usage within the python environment itself - e.g. Heapy/Guppy.

我相信有一些库会显示 python 环境本身的内存使用情况 - 例如 Heapy/Guppy。

There ought to be a metric you can apply that takes a baseline picture of memory usage prior to creating the object under inspection, then another picture afterwards. Comparison of the two memory maps (assuming nothing else has been created and we can isolate the change is due to the new object) should provide an idea of whether a view or copy has been produced.

应该有一个指标可以应用,在创建受检查对象之前拍摄内存使用情况的基线图片,然后是另一张图片。两个内存映射的比较(假设没有创建任何其他内容并且我们可以隔离更改是由于新对象引起的)应该提供一个视图或副本是否已生成的想法。

We'd need to get an idea of the different memory profiles of each type of implementation, but some experimentation should yield results.

我们需要了解每种实现类型的不同内存配置文件,但一些实验应该会产生结果。