pandas 循环遍历熊猫数据框列表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41812564/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Looping through a list of pandas dataframes
提问by Iwan Thomas
Two quick pandas questions for you.
两个简单的Pandas问题给你。
I have a list of dataframes I would like to apply a filter to.
countries = [us, uk, france] for df in countries: df = df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
When I run this, the df's don't change afterwards. Why is that? If I loop through the dataframes to create a new column, as below, this works fine, and changes each df in the list.
for df in countries: df["Continent"] = "Europe"
As a follow up question, I noticed something strange when I created a list of dataframes for different countries. I defined the list then applied transformations to each df in the list. After I transformed these different dfs, I called the list again. I was surprised to see that the list still pointed to the unchanged dataframes, and I had to redefine the list to update the results. Could anybody shed any light on why that is?
我有一个数据框列表,我想对其应用过滤器。
countries = [us, uk, france] for df in countries: df = df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
当我运行它时,df 之后不会改变。这是为什么?如果我循环遍历数据框以创建一个新列,如下所示,这可以正常工作,并更改列表中的每个 df。
for df in countries: df["Continent"] = "Europe"
作为后续问题,当我为不同国家/地区创建数据框列表时,我注意到了一些奇怪的事情。我定义了列表,然后对列表中的每个 df 应用了转换。在我转换了这些不同的 dfs 之后,我再次调用了列表。我很惊讶地看到列表仍然指向未更改的数据框,我不得不重新定义列表以更新结果。任何人都可以解释为什么会这样吗?
回答by miradulo
Taking a look at this answer, you can see that for df in countries:
is equivalent to something like
看看这个答案,你可以看到它for df in countries:
相当于
for idx in range(len(countries)):
df = countries[idx]
# do something with df
which obviously won't actually modify anything in your list. It is generally bad practice to modify a list while iterating over it in a loop like this.
这显然不会实际修改您列表中的任何内容。在像这样的循环中迭代列表时修改列表通常是不好的做法。
A better approach would be a list comprehension, you can try something like
更好的方法是列表理解,您可以尝试类似的方法
countries = [us, uk, france]
countries = [df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
for df in countries]
Notice that with a list comprehension like this, we aren't actually modifying the original list - instead we are creating a new list, and assigning it to the variable which held our original list.
请注意,使用这样的列表推导式,我们实际上并没有修改原始列表——而是创建一个新列表,并将其分配给保存原始列表的变量。
Also, you might consider placing all of your data in a single DataFrame with an additional country column or something along those lines - Python-level loops are generally slower and a list of DataFrames is often much less convenient to work with than a single DataFrame, which can fully leverage the vectorized pandas methods.
此外,您可能会考虑将所有数据放在一个单独的 DataFrame 中,并带有一个额外的国家/地区列或类似的内容 - Python 级别的循环通常较慢,并且 DataFrame 列表通常比单个 DataFrame 更不方便,它可以充分利用矢量化的Pandas方法。
回答by Janet Lu
For why
为什么
for df in countries:
df["Continent"] = "Europe"
modifies countries, while
修改国家,而
for df in countries:
df = df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
does not, see why should I make a copy of a data frame in pandas. df is a reference to the actual DataFrame in countries, and not the actual DataFrame itself, but modifications to a reference affect the original DataFrame as well. Declaring a new column is a modification. However, taking a subset is not a modification. It is just changing what the reference is referring to in the original DataFrame.
没有,看看我为什么要在 pandas 中制作数据框的副本。df 是对国家/地区实际 DataFrame 的引用,而不是实际 DataFrame 本身,但对引用的修改也会影响原始 DataFrame。声明一个新列是一种修改。然而,取一个子集并不是一种修改。它只是改变了原始 DataFrame 中引用所指的内容。