将 for 循环应用于 Pandas 中的多个 DataFrame

Question

提问by sparrow

I have multiple DataFrames that I want to do the same thing to.

我有多个 DataFrame 想要做同样的事情。

First I create a list of the DataFrames. All of them have the same column called 'result'.

首先，我创建了一个 DataFrame 列表。它们都有相同的列，称为“结果”。

df_list = [df1,df2,df3]

I want to keep only the rows in all the DataFrames with value 'passed' so I use a for loop on my list:

我只想保留所有 DataFrame 中值为 'passed' 的行，所以我在我的列表中使用 for 循环：

for df in df_list:
    df =df[df['result'] == 'passed']

...this does not work, the values are not filtered out of each DataFrame.

...这不起作用，值不会从每个 DataFrame 中过滤掉。

If I filter each one separately then it does work.

如果我分别过滤每一个，那么它确实有效。

df1 =df1[df1['result'] == 'passed']
df2 =df2[df2['result'] == 'passed']
df3 =df3[df3['result'] == 'passed']

Answer 1

回答by juanpa.arrivillaga

This is because every time you do a subset like this df[<whatever>]you are returning a new dataframe, and assigning it to the dflooping variable, which gets obliterated each time you go to the next iteration (although you do keep the last one). This similar to slicing lists:

这是因为每次执行这样的子集时，您都会df[<whatever>]返回一个新的数据帧，并将其分配给df循环变量，每次进入下一次迭代时都会将其删除（尽管您确实保留了最后一个）。这类似于切片列表：

>>> list1 = [1,2,3,4]
>>> list2 = [11,12,13,14]
>>> for lyst in list1,list2:
...   lyst = lyst[1:-1]
... 
>>> list1, list2
([1, 2, 3, 4], [11, 12, 13, 14])
>>> lyst
[12, 13]

Usually, you need to use a mutator method if you want to actually modify the lists in-place. Equivalently, with a dataframe, you could use assignment on an indexer, e.g. .loc/.ix/.iloc/etc in combination with the .dropnamethod, being careful to pass the inplace=Trueargument. Suppose I have three dataframes and I want to only keep the rows where my second column is positive:

通常，如果要实际就地修改列表，则需要使用 mutator 方法。等效地，对于数据框，您可以在索引器上使用赋值，例如.loc/.ix/.iloc/与.dropna方法结合使用，小心传递inplace=True参数。假设我有三个数据框，我只想保留第二列为正的行：

Warning: This way is not ideal, look at edit for better way

警告：这种方式并不理想，请查看编辑以获得更好的方式

In [11]: df1
Out[11]: 
          0         1         2         3
0  0.957288 -0.170286  0.406841 -3.058443
1  1.762343 -1.837631 -0.867520  1.666193
2  0.618665  0.660312 -1.319740 -0.024854
3 -2.008017 -0.445997 -0.028739 -0.227665
4  0.638419 -0.271300 -0.918894  1.524009
5  0.957006  1.181246  0.513298  0.370174
6  0.613378 -0.852546 -1.778761 -1.386848
7 -1.891993 -0.304533 -1.427700  0.099904

In [12]: df2
Out[12]: 
          0         1         2         3
0 -0.521018  0.407258 -1.167445 -0.363503
1 -0.879489  0.008560  0.224466 -0.165863
2  0.550845 -0.102224 -0.575909 -0.404770
3 -1.171828 -0.912451 -1.197273  0.719489
4 -0.887862  1.073306  0.351835  0.313953
5 -0.517824 -0.096929 -0.300282  0.716020
6 -1.121527  0.183219  0.938509  0.842882
7  0.003498 -2.241854 -1.146984 -0.751192

In [13]: df3
Out[13]: 
          0         1         2         3
0  0.240411  0.795132 -0.305770 -0.332253
1 -1.162097  0.055346  0.094363 -1.254859
2 -0.493466 -0.717872  1.090417 -0.591872
3  1.021246 -0.060453 -0.013952  0.304933
4 -0.859882 -0.947950  0.562609  1.313632
5  0.917199  1.186865  0.354839 -1.771787
6 -0.694799 -0.695505 -1.077890 -0.880563
7  1.088068 -0.893466 -0.188419 -0.451623

In [14]: for df in df1, df2, df3:
   ....:     df.loc[:,:] = df.loc[df[1] > 0,:]
   ....:     df.dropna(inplace = True,axis =0)
   ....:     

In [15]: df1
dfOut[15]: 
          0         1         2         3
2  0.618665  0.660312 -1.319740 -0.024854
5  0.957006  1.181246  0.513298  0.370174

In [16]: df2
Out[16]: 
          0         1         2         3
0 -0.521018  0.407258 -1.167445 -0.363503
1 -0.879489  0.008560  0.224466 -0.165863
4 -0.887862  1.073306  0.351835  0.313953
6 -1.121527  0.183219  0.938509  0.842882

In [17]: df3
Out[17]: 
          0         1         2         3
0  0.240411  0.795132 -0.305770 -0.332253
1 -1.162097  0.055346  0.094363 -1.254859
5  0.917199  1.186865  0.354839 -1.771787

Edited to Add:

编辑添加：

I think I found a better way just using the .dropmethod.

我想我找到了一个更好的方法，就是使用这个.drop方法。

In [21]: df1
Out[21]: 
          0         1         2         3
0 -0.804913 -0.481498  0.076843  1.136567
1 -0.457197 -0.903681 -0.474828  1.289443
2 -0.820710  1.610072  0.175455  0.712052
3  0.715610 -0.178728 -0.664992  1.261465
4 -0.297114 -0.591935  0.487698  0.760450
5  1.035231 -0.108825 -1.058996  0.056320
6  1.579931  0.958331 -0.653261 -0.171245
7  0.685427  1.447411  0.001002  0.241999

In [22]: df2
Out[22]: 
          0         1         2         3
0  1.660864  0.110002  0.366881  1.765541
1 -0.627716  1.341457 -0.552313  0.578854
2  0.277738  0.128419 -0.279720 -1.197483
3 -1.294724  1.396698  0.108767  1.353454
4 -0.379995  0.215192  1.446584  0.530020
5  0.557042  0.339192 -0.105808 -0.693267
6  1.293941  0.203973 -3.051011  1.638143
7 -0.909982  1.998656 -0.057350  2.279443

In [23]: df3
Out[23]: 
          0         1         2         3
0 -0.002327 -2.054557 -1.752107 -0.911178
1 -0.998328 -1.119856  1.468124 -0.961131
2 -0.048568  0.373192 -0.666330  0.867719
3  0.533597 -1.222963  0.119789 -0.037949
4  1.203075 -0.773511  0.475809  1.352943
5 -0.984069 -0.352267 -0.313516  0.138259
6  0.114596  0.354404  2.119963 -0.452462
7 -1.033029 -0.787237  0.479321 -0.818260


In [25]: for df in df1,df2,df3:
   ....:     df.drop(df.index[df[1] < 0],axis=0,inplace=True)
   ....:     

In [26]: df1
Out[26]: 
          0         1         2         3
2 -0.820710  1.610072  0.175455  0.712052
6  1.579931  0.958331 -0.653261 -0.171245
7  0.685427  1.447411  0.001002  0.241999

In [27]: df2
Out[27]: 
          0         1         2         3
0  1.660864  0.110002  0.366881  1.765541
1 -0.627716  1.341457 -0.552313  0.578854
2  0.277738  0.128419 -0.279720 -1.197483
3 -1.294724  1.396698  0.108767  1.353454
4 -0.379995  0.215192  1.446584  0.530020
5  0.557042  0.339192 -0.105808 -0.693267
6  1.293941  0.203973 -3.051011  1.638143
7 -0.909982  1.998656 -0.057350  2.279443

In [28]: df3
Out[28]: 
          0         1         2         3
2 -0.048568  0.373192 -0.666330  0.867719
6  0.114596  0.354404  2.119963 -0.452462

Certainly faster:

当然更快：

In [8]: timeit.Timer(stmt="df.loc[:,:] = df.loc[df[1] > 0, :];df.dropna(inplace = True,axis =0)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[8]: 23.69621358400036

In [9]: timeit.Timer(stmt="df.drop(df.index[df[1] < 0],axis=0,inplace=True)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[9]: 11.476448250003159

将 for 循环应用于 Pandas 中的多个 DataFrame

提问by sparrow

回答by juanpa.arrivillaga

Warning: This way is not ideal, look at edit for better way

警告：这种方式并不理想，请查看编辑以获得更好的方式

Edited to Add:

编辑添加：

相关推荐

最近更新

标签

将 for 循环应用于 Pandas 中的多个 DataFrame

提问by sparrow

回答by juanpa.arrivillaga

Warning: This way is not ideal, look at edit for better way

警告：这种方式并不理想，请查看编辑以获得更好的方式

Edited to Add:

编辑添加：

相关推荐

pandas python pandas中的Groupby：快速方法

如何在 Pandas 中的 transpose() 之后删除多余的行（或列）

pandas 如何用熊猫绘制年龄分布

在 Pandas 数据框中使用 for 循环迭代列

相关推荐

最近更新

标签