将 for 循环应用于 Pandas 中的多个 DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38297292/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:33:22  来源:igfitidea点击:

Apply a for loop to multiple DataFrames in Pandas

pythonpandasdataframe

提问by sparrow

I have multiple DataFrames that I want to do the same thing to.

我有多个 DataFrame 想要做同样的事情。

First I create a list of the DataFrames. All of them have the same column called 'result'.

首先,我创建了一个 DataFrame 列表。它们都有相同的列,称为“结果”。

df_list = [df1,df2,df3]

I want to keep only the rows in all the DataFrames with value 'passed' so I use a for loop on my list:

我只想保留所有 DataFrame 中值为 'passed' 的行,所以我在我的列表中使用 for 循环:

for df in df_list:
    df =df[df['result'] == 'passed']

...this does not work, the values are not filtered out of each DataFrame.

...这不起作用,值不会从每个 DataFrame 中过滤掉。

If I filter each one separately then it does work.

如果我分别过滤每一个,那么它确实有效。

df1 =df1[df1['result'] == 'passed']
df2 =df2[df2['result'] == 'passed']
df3 =df3[df3['result'] == 'passed']

回答by juanpa.arrivillaga

This is because every time you do a subset like this df[<whatever>]you are returning a new dataframe, and assigning it to the dflooping variable, which gets obliterated each time you go to the next iteration (although you do keep the last one). This similar to slicing lists:

这是因为每次执行这样的子集时,您都会df[<whatever>]返回一个新的数据帧,并将其分配给df循环变量,每次进入下一次迭代时都会将其删除(尽管您确实保留了最后一个)。这类似于切片列表:

>>> list1 = [1,2,3,4]
>>> list2 = [11,12,13,14]
>>> for lyst in list1,list2:
...   lyst = lyst[1:-1]
... 
>>> list1, list2
([1, 2, 3, 4], [11, 12, 13, 14])
>>> lyst
[12, 13]

Usually, you need to use a mutator method if you want to actually modify the lists in-place. Equivalently, with a dataframe, you could use assignment on an indexer, e.g. .loc/.ix/.iloc/etc in combination with the .dropnamethod, being careful to pass the inplace=Trueargument. Suppose I have three dataframes and I want to only keep the rows where my second column is positive:

通常,如果要实际就地修改列表,则需要使用 mutator 方法。等效地,对于数据框,您可以在索引器上使用赋值,例如.loc/.ix/.iloc/.dropna方法结合使用,小心传递inplace=True参数。假设我有三个数据框,我只想保留第二列为正的行:

Warning: This way is not ideal, look at edit for better way

警告:这种方式并不理想,请查看编辑以获得更好的方式

In [11]: df1
Out[11]: 
          0         1         2         3
0  0.957288 -0.170286  0.406841 -3.058443
1  1.762343 -1.837631 -0.867520  1.666193
2  0.618665  0.660312 -1.319740 -0.024854
3 -2.008017 -0.445997 -0.028739 -0.227665
4  0.638419 -0.271300 -0.918894  1.524009
5  0.957006  1.181246  0.513298  0.370174
6  0.613378 -0.852546 -1.778761 -1.386848
7 -1.891993 -0.304533 -1.427700  0.099904

In [12]: df2
Out[12]: 
          0         1         2         3
0 -0.521018  0.407258 -1.167445 -0.363503
1 -0.879489  0.008560  0.224466 -0.165863
2  0.550845 -0.102224 -0.575909 -0.404770
3 -1.171828 -0.912451 -1.197273  0.719489
4 -0.887862  1.073306  0.351835  0.313953
5 -0.517824 -0.096929 -0.300282  0.716020
6 -1.121527  0.183219  0.938509  0.842882
7  0.003498 -2.241854 -1.146984 -0.751192

In [13]: df3
Out[13]: 
          0         1         2         3
0  0.240411  0.795132 -0.305770 -0.332253
1 -1.162097  0.055346  0.094363 -1.254859
2 -0.493466 -0.717872  1.090417 -0.591872
3  1.021246 -0.060453 -0.013952  0.304933
4 -0.859882 -0.947950  0.562609  1.313632
5  0.917199  1.186865  0.354839 -1.771787
6 -0.694799 -0.695505 -1.077890 -0.880563
7  1.088068 -0.893466 -0.188419 -0.451623

In [14]: for df in df1, df2, df3:
   ....:     df.loc[:,:] = df.loc[df[1] > 0,:]
   ....:     df.dropna(inplace = True,axis =0)
   ....:     

In [15]: df1
dfOut[15]: 
          0         1         2         3
2  0.618665  0.660312 -1.319740 -0.024854
5  0.957006  1.181246  0.513298  0.370174

In [16]: df2
Out[16]: 
          0         1         2         3
0 -0.521018  0.407258 -1.167445 -0.363503
1 -0.879489  0.008560  0.224466 -0.165863
4 -0.887862  1.073306  0.351835  0.313953
6 -1.121527  0.183219  0.938509  0.842882

In [17]: df3
Out[17]: 
          0         1         2         3
0  0.240411  0.795132 -0.305770 -0.332253
1 -1.162097  0.055346  0.094363 -1.254859
5  0.917199  1.186865  0.354839 -1.771787

Edited to Add:

编辑添加:

I think I found a better way just using the .dropmethod.

我想我找到了一个更好的方法,就是使用这个.drop方法。

In [21]: df1
Out[21]: 
          0         1         2         3
0 -0.804913 -0.481498  0.076843  1.136567
1 -0.457197 -0.903681 -0.474828  1.289443
2 -0.820710  1.610072  0.175455  0.712052
3  0.715610 -0.178728 -0.664992  1.261465
4 -0.297114 -0.591935  0.487698  0.760450
5  1.035231 -0.108825 -1.058996  0.056320
6  1.579931  0.958331 -0.653261 -0.171245
7  0.685427  1.447411  0.001002  0.241999

In [22]: df2
Out[22]: 
          0         1         2         3
0  1.660864  0.110002  0.366881  1.765541
1 -0.627716  1.341457 -0.552313  0.578854
2  0.277738  0.128419 -0.279720 -1.197483
3 -1.294724  1.396698  0.108767  1.353454
4 -0.379995  0.215192  1.446584  0.530020
5  0.557042  0.339192 -0.105808 -0.693267
6  1.293941  0.203973 -3.051011  1.638143
7 -0.909982  1.998656 -0.057350  2.279443

In [23]: df3
Out[23]: 
          0         1         2         3
0 -0.002327 -2.054557 -1.752107 -0.911178
1 -0.998328 -1.119856  1.468124 -0.961131
2 -0.048568  0.373192 -0.666330  0.867719
3  0.533597 -1.222963  0.119789 -0.037949
4  1.203075 -0.773511  0.475809  1.352943
5 -0.984069 -0.352267 -0.313516  0.138259
6  0.114596  0.354404  2.119963 -0.452462
7 -1.033029 -0.787237  0.479321 -0.818260


In [25]: for df in df1,df2,df3:
   ....:     df.drop(df.index[df[1] < 0],axis=0,inplace=True)
   ....:     

In [26]: df1
Out[26]: 
          0         1         2         3
2 -0.820710  1.610072  0.175455  0.712052
6  1.579931  0.958331 -0.653261 -0.171245
7  0.685427  1.447411  0.001002  0.241999

In [27]: df2
Out[27]: 
          0         1         2         3
0  1.660864  0.110002  0.366881  1.765541
1 -0.627716  1.341457 -0.552313  0.578854
2  0.277738  0.128419 -0.279720 -1.197483
3 -1.294724  1.396698  0.108767  1.353454
4 -0.379995  0.215192  1.446584  0.530020
5  0.557042  0.339192 -0.105808 -0.693267
6  1.293941  0.203973 -3.051011  1.638143
7 -0.909982  1.998656 -0.057350  2.279443

In [28]: df3
Out[28]: 
          0         1         2         3
2 -0.048568  0.373192 -0.666330  0.867719
6  0.114596  0.354404  2.119963 -0.452462

Certainly faster:

当然更快:

In [8]: timeit.Timer(stmt="df.loc[:,:] = df.loc[df[1] > 0, :];df.dropna(inplace = True,axis =0)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[8]: 23.69621358400036

In [9]: timeit.Timer(stmt="df.drop(df.index[df[1] < 0],axis=0,inplace=True)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[9]: 11.476448250003159