Pandas:了解操作何时影响原始数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48173980/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: Knowing when an operation affects the original dataframe
提问by ejolly
I love pandas and have been using it for years and feel pretty confident I have a good handle on how to subset dataframes and deal with views vs copies appropriately (though I use a lot of assertions to be sure). I also know that there have been tons of questions about SettingWithCopyWarning, e.g. How to deal with SettingWithCopyWarning in Pandas?and some great recent guides on wrapping your head around when it happens, e.g. Understanding SettingWithCopyWarning in pandas.
我喜欢 pandas 并且已经使用它多年,并且非常有信心我可以很好地处理如何对数据帧进行子集化并适当地处理视图与副本(尽管我使用了很多断言来确定)。我也知道有很多关于 SettingWithCopyWarning 的问题,例如如何在 Pandas 中处理 SettingWithCopyWarning?以及一些关于在发生这种情况时绕着脑袋转的近期指南,例如了解 pandas 中的 SettingWithCopyWarning。
But I also know specific things like the quote from this answerare no longer in the most recent docs (0.22.0
) and that many things have been deprecated over the years (leading to some inappropriate old SO answers), and that things are continuing to change.
但我也知道像这个答案中的引述这样的具体内容不再出现在最近的文档 ( 0.22.0
) 中,而且这些年来许多东西已经被弃用(导致一些不合适的旧 SO 答案),并且事情还在继续变化。
Recently after teaching pandas to complete newcomers with very basic general Python knowledge about things like avoiding chained-indexing (and using .iloc
/.loc
), I've still struggled to provide general rules of thumbto know when it's important to pay attention to the SettingWithCopyWarning
(e.g. when it's safe to ignore it).
最近,在用非常基本的 Python 知识(例如避免链式索引(和使用.iloc
/ .loc
))教 Pandas 完成新手之后,我仍然努力提供一般的经验法则来知道什么时候需要注意SettingWithCopyWarning
(例如,当忽略它是安全的)。
I've personally found that the specific pattern of subsetting a dataframe according so some rule (e.g. slicing or boolean operation) and then modifying that subset, independent of the original dataframe, is a much more common operation than the docs suggest. In this situation we want to modify the copy not the originaland the warning is confusing/scary to newcomers.
我个人发现,根据某些规则(例如切片或布尔运算)对数据帧进行子集化,然后独立于原始数据帧修改该子集的特定模式是比文档建议的更常见的操作。在这种情况下,我们想要修改副本而不是原件,并且警告对于新手来说是令人困惑/可怕的。
I know it's not trivial to know ahead of time when a view vs a copy is returned, e.g.
What rules does Pandas use to generate a view vs a copy?
Checking whether data frame is copy or view in Pandas
我知道提前知道何时返回视图与副本并不是一件容易的事,例如
Pandas 使用什么规则来生成视图与副本?
在 Pandas 中检查数据框是复制还是查看
So instead I'm looking for the answer to a more general (beginner friendly) question: when does performing an operation on a subsetted dataframe affect the original dataframe from which it was created, and when are they independent?.
因此,我正在寻找一个更一般(初学者友好)问题的答案:对子集数据帧执行操作何时会影响创建它的原始数据帧,它们何时独立?.
I've created some cases below that I think seem reasonable, but I'm not sure if there's a "gotcha" I'm missing or if there's any easier way to think/check this. I was hoping someone could confirm that my intuitions about the following use cases are correct as the pertain to my question above.
我在下面创建了一些我认为看起来合理的案例,但我不确定是否有我遗漏的“陷阱”,或者是否有任何更简单的思考/检查方法。我希望有人可以确认我对以下用例的直觉是正确的,因为这与我上面的问题有关。
import pandas as pd
df1 = pd.DataFrame({'A':[2,4,6,8,10],'B':[1,3,5,7,9],'C':[10,20,30,40,50]})
1) Warning: No
Original changed: No
1) 警告:无
原始更改:无
# df1 will be unaffected because we use .copy() method explicitly
df2 = df1.copy()
#
# Reference: docs
df2.iloc[0,1] = 100
2) Warning: Yes (I don't really understood why)
Original changed: No
2)警告:是(我真的不明白为什么)
原始更改:否
# df1 will be unaffected because .query() always returns a copy
#
# Reference:
# https://stackoverflow.com/a/23296545/8022335
df2 = df1.query('A < 10')
df2.iloc[0,1] = 100
3) Warning: Yes
Original changed: No
3)警告:是
原始更改:否
# df1 will be unaffected because boolean indexing with .loc
# always returns a copy
#
# Reference:
# https://stackoverflow.com/a/17961468/8022335
df2 = df1.loc[df1['A'] < 10,:]
df2.iloc[0,1] = 100
4) Warning: No
Original changed: No
4) 警告:无
原始更改:无
# df1 will be unaffected because list indexing with .loc (or .iloc)
# always returns a copy
#
# Reference:
# Same as 4)
df2 = df1.loc[[0,3,4],:]
df2.iloc[0,1] = 100
5) Warning: No
Original changed: Yes (confusing to newcomers but makes sense)
5) 警告:否
原文已更改:是(对新人来说令人困惑但有道理)
# df1 will be affected because scalar/slice indexing with .iloc/.loc
# always references the original dataframe, but may sometimes
# provide a view and sometimes provide a copy
#
# Reference: docs
df2 = df1.loc[:10,:]
df2.iloc[0,1] = 100
tl;drWhen creating a new dataframe from the original, changing the new dataframe:
Will change the original when scalar/slice indexing with .loc/.iloc is used to create the new dataframe.
Will notchange the original when boolean indexing with .loc, .query()
, or .copy()
is used to create the new dataframe
tl;dr从原始数据帧创建新数据帧时,更改新数据帧:使用 .loc/.iloc 标量/切片索引创建新数据帧
时,将更改原始数据帧。
会不会改变原有的时候布尔索引用的.loc,.query()
或.copy()
用于创建新的数据帧
回答by JohnE
This is a somewhat confusing and even frustrating part of pandas, but for the most part you shouldn't really have to worry about this if you follow some simple workflow rules. In particular, note that there are only two general cases here when you have two dataframes, with one being a subset of the other.
这是 Pandas 的一个有点令人困惑甚至令人沮丧的部分,但在大多数情况下,如果您遵循一些简单的工作流程规则,您就不必担心这一点。特别要注意的是,当您有两个数据帧时,这里只有两种一般情况,其中一个是另一个的子集。
This is a case where the Zen of Python rule "explicit is better than implicit" is a great guideline to follow.
在这种情况下,Python 的 Zen 规则“显式优于隐式”是一个很好的指导方针。
Case A: Changes to df2
should NOT affect df1
案例 A:更改df2
不应影响df1
This is trivial, of course. You want two completely independent dataframes so you just explicitly make a copy:
当然,这是微不足道的。您需要两个完全独立的数据帧,因此您只需明确复制:
df2 = df1.copy()
After this anything you do to df2
affects only df2
and not df1
and vice versa.
在此之后,您所做的任何事情df2
只会影响df2
而不是df1
,反之亦然。
Case B: Changes to df2
should ALSO affect df1
情况 B:更改也df2
应影响df1
In this case I don't think there is one general way to solve the problem because it depends on exactly what you're trying to do. However, there are a couple of standard approaches that are pretty straightforward and should not have any ambiguity about how they are working.
在这种情况下,我认为没有一种通用的方法可以解决问题,因为这完全取决于您要尝试做什么。但是,有一些标准方法非常简单,并且不应该对它们的工作方式有任何歧义。
Method 1: Copy df1 to df2, then use df2 to update df1
方法一:复制df1到df2,然后用df2更新df1
In this case, you can basically do a one to one conversion of the examples above. Here's example #2:
在这种情况下,您基本上可以对上述示例进行一对一转换。这是示例#2:
df2 = df1.copy()
df2 = df1.query('A < 10')
df2.iloc[0,1] = 100
df1 = df2.append(df1).reset_index().drop_duplicates(subset='index').drop(columns='index')
Unfortunately the re-merging via append
is a bit verbose there. You can do it more cleanly with the following, although it has the side effect of converting integers to floats.
不幸的是,重新合并通过append
在那里有点冗长。您可以使用以下内容更干净地完成它,尽管它具有将整数转换为浮点数的副作用。
df1.update(df2) # note that this is an inplace operation
Method 2: Use a mask (don't create df2
at all)
方法二:使用遮罩(完全不要创建df2
)
I think the best general approach here is not to create df2
at all, but rather have it be a masked version of df1
. Somewhat unfortunately, you can't do a direct translation of the above code due to its mixing of loc
and iloc
which is fine for this example though probably unrealistic for actual use.
我认为这里最好的一般方法根本不是创建df2
,而是让它成为df1
. 有点遗憾的是,由于上面的代码混合了loc
和,因此您无法直接翻译上面的代码,iloc
这对于本示例来说很好,但在实际使用中可能不切实际。
The advantage is that you can write very simple and readable code. Here's an alternative version of example #2 above where df2
is actually just a masked version of df1
. But instead of changing via iloc
, I'll change if column "C" == 10.
优点是您可以编写非常简单易读的代码。这是上面示例#2 的替代版本,其中df2
实际上只是df1
. 但是iloc
如果列“C”== 10,我将更改而不是更改 via 。
df2_mask = df1['A'] < 10
df1.loc[ df2_mask & (df1['C'] == 10), 'B'] = 100
Now if you print df1
or df1[df2_mask]
you will see that column "B" = 100 for the first row of each dataframe. Obviously this is not very surprising here, but that's the inherent advantage of following "explicit is better than implicit".
现在,如果您打印df1
或者df1[df2_mask]
您将看到每个数据帧的第一行的“B”列 = 100。显然,这在这里并不令人惊讶,但这就是遵循“显式优于隐式”的固有优势。
回答by romulomadu
I have the same doubt, I searched for this response in the past without success. So now, I just certify that original is not changing and use this peace of code to the program at begining to remove warnings:
我也有同样的疑问,我过去搜索过这个响应没有成功。所以现在,我只是证明原件没有改变,并在开始删除警告时使用这个和平的代码到程序中:
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'
回答by alububu
You only need to replace .iloc[0,1]
with .iat[0,1]
.
您只需要替换.iloc[0,1]
为.iat[0,1]
.
More in general if you want to modify only one element you should use .iat
or .at
method. Instead when you are modifying more elements at one time you should use .loc
or .iloc
methods.
更一般地说,如果您只想修改您应该使用的一个元素.iat
或.at
方法。相反,当您一次修改更多元素时,您应该使用.loc
或.iloc
方法。
Doing in this way pandas shuldn't throw any warning.
这样做Pandas应该不会发出任何警告。