将 for 循环应用于 Pandas 中的多个 DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38297292/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Apply a for loop to multiple DataFrames in Pandas
提问by sparrow
I have multiple DataFrames that I want to do the same thing to.
我有多个 DataFrame 想要做同样的事情。
First I create a list of the DataFrames. All of them have the same column called 'result'.
首先,我创建了一个 DataFrame 列表。它们都有相同的列,称为“结果”。
df_list = [df1,df2,df3]
I want to keep only the rows in all the DataFrames with value 'passed' so I use a for loop on my list:
我只想保留所有 DataFrame 中值为 'passed' 的行,所以我在我的列表中使用 for 循环:
for df in df_list:
df =df[df['result'] == 'passed']
...this does not work, the values are not filtered out of each DataFrame.
...这不起作用,值不会从每个 DataFrame 中过滤掉。
If I filter each one separately then it does work.
如果我分别过滤每一个,那么它确实有效。
df1 =df1[df1['result'] == 'passed']
df2 =df2[df2['result'] == 'passed']
df3 =df3[df3['result'] == 'passed']
回答by juanpa.arrivillaga
This is because every time you do a subset like this df[<whatever>]
you are returning a new dataframe, and assigning it to the df
looping variable, which gets obliterated each time you go to the next iteration (although you do keep the last one). This similar to slicing lists:
这是因为每次执行这样的子集时,您都会df[<whatever>]
返回一个新的数据帧,并将其分配给df
循环变量,每次进入下一次迭代时都会将其删除(尽管您确实保留了最后一个)。这类似于切片列表:
>>> list1 = [1,2,3,4]
>>> list2 = [11,12,13,14]
>>> for lyst in list1,list2:
... lyst = lyst[1:-1]
...
>>> list1, list2
([1, 2, 3, 4], [11, 12, 13, 14])
>>> lyst
[12, 13]
Usually, you need to use a mutator method if you want to actually modify the lists in-place. Equivalently, with a dataframe, you could use assignment on an indexer, e.g. .loc/.ix/.iloc/
etc in combination with the .dropna
method, being careful to pass the inplace=True
argument. Suppose I have three dataframes and I want to only keep the rows where my second column is positive:
通常,如果要实际就地修改列表,则需要使用 mutator 方法。等效地,对于数据框,您可以在索引器上使用赋值,例如.loc/.ix/.iloc/
与.dropna
方法结合使用,小心传递inplace=True
参数。假设我有三个数据框,我只想保留第二列为正的行:
Warning: This way is not ideal, look at edit for better way
警告:这种方式并不理想,请查看编辑以获得更好的方式
In [11]: df1
Out[11]:
0 1 2 3
0 0.957288 -0.170286 0.406841 -3.058443
1 1.762343 -1.837631 -0.867520 1.666193
2 0.618665 0.660312 -1.319740 -0.024854
3 -2.008017 -0.445997 -0.028739 -0.227665
4 0.638419 -0.271300 -0.918894 1.524009
5 0.957006 1.181246 0.513298 0.370174
6 0.613378 -0.852546 -1.778761 -1.386848
7 -1.891993 -0.304533 -1.427700 0.099904
In [12]: df2
Out[12]:
0 1 2 3
0 -0.521018 0.407258 -1.167445 -0.363503
1 -0.879489 0.008560 0.224466 -0.165863
2 0.550845 -0.102224 -0.575909 -0.404770
3 -1.171828 -0.912451 -1.197273 0.719489
4 -0.887862 1.073306 0.351835 0.313953
5 -0.517824 -0.096929 -0.300282 0.716020
6 -1.121527 0.183219 0.938509 0.842882
7 0.003498 -2.241854 -1.146984 -0.751192
In [13]: df3
Out[13]:
0 1 2 3
0 0.240411 0.795132 -0.305770 -0.332253
1 -1.162097 0.055346 0.094363 -1.254859
2 -0.493466 -0.717872 1.090417 -0.591872
3 1.021246 -0.060453 -0.013952 0.304933
4 -0.859882 -0.947950 0.562609 1.313632
5 0.917199 1.186865 0.354839 -1.771787
6 -0.694799 -0.695505 -1.077890 -0.880563
7 1.088068 -0.893466 -0.188419 -0.451623
In [14]: for df in df1, df2, df3:
....: df.loc[:,:] = df.loc[df[1] > 0,:]
....: df.dropna(inplace = True,axis =0)
....:
In [15]: df1
dfOut[15]:
0 1 2 3
2 0.618665 0.660312 -1.319740 -0.024854
5 0.957006 1.181246 0.513298 0.370174
In [16]: df2
Out[16]:
0 1 2 3
0 -0.521018 0.407258 -1.167445 -0.363503
1 -0.879489 0.008560 0.224466 -0.165863
4 -0.887862 1.073306 0.351835 0.313953
6 -1.121527 0.183219 0.938509 0.842882
In [17]: df3
Out[17]:
0 1 2 3
0 0.240411 0.795132 -0.305770 -0.332253
1 -1.162097 0.055346 0.094363 -1.254859
5 0.917199 1.186865 0.354839 -1.771787
Edited to Add:
编辑添加:
I think I found a better way just using the .drop
method.
我想我找到了一个更好的方法,就是使用这个.drop
方法。
In [21]: df1
Out[21]:
0 1 2 3
0 -0.804913 -0.481498 0.076843 1.136567
1 -0.457197 -0.903681 -0.474828 1.289443
2 -0.820710 1.610072 0.175455 0.712052
3 0.715610 -0.178728 -0.664992 1.261465
4 -0.297114 -0.591935 0.487698 0.760450
5 1.035231 -0.108825 -1.058996 0.056320
6 1.579931 0.958331 -0.653261 -0.171245
7 0.685427 1.447411 0.001002 0.241999
In [22]: df2
Out[22]:
0 1 2 3
0 1.660864 0.110002 0.366881 1.765541
1 -0.627716 1.341457 -0.552313 0.578854
2 0.277738 0.128419 -0.279720 -1.197483
3 -1.294724 1.396698 0.108767 1.353454
4 -0.379995 0.215192 1.446584 0.530020
5 0.557042 0.339192 -0.105808 -0.693267
6 1.293941 0.203973 -3.051011 1.638143
7 -0.909982 1.998656 -0.057350 2.279443
In [23]: df3
Out[23]:
0 1 2 3
0 -0.002327 -2.054557 -1.752107 -0.911178
1 -0.998328 -1.119856 1.468124 -0.961131
2 -0.048568 0.373192 -0.666330 0.867719
3 0.533597 -1.222963 0.119789 -0.037949
4 1.203075 -0.773511 0.475809 1.352943
5 -0.984069 -0.352267 -0.313516 0.138259
6 0.114596 0.354404 2.119963 -0.452462
7 -1.033029 -0.787237 0.479321 -0.818260
In [25]: for df in df1,df2,df3:
....: df.drop(df.index[df[1] < 0],axis=0,inplace=True)
....:
In [26]: df1
Out[26]:
0 1 2 3
2 -0.820710 1.610072 0.175455 0.712052
6 1.579931 0.958331 -0.653261 -0.171245
7 0.685427 1.447411 0.001002 0.241999
In [27]: df2
Out[27]:
0 1 2 3
0 1.660864 0.110002 0.366881 1.765541
1 -0.627716 1.341457 -0.552313 0.578854
2 0.277738 0.128419 -0.279720 -1.197483
3 -1.294724 1.396698 0.108767 1.353454
4 -0.379995 0.215192 1.446584 0.530020
5 0.557042 0.339192 -0.105808 -0.693267
6 1.293941 0.203973 -3.051011 1.638143
7 -0.909982 1.998656 -0.057350 2.279443
In [28]: df3
Out[28]:
0 1 2 3
2 -0.048568 0.373192 -0.666330 0.867719
6 0.114596 0.354404 2.119963 -0.452462
Certainly faster:
当然更快:
In [8]: timeit.Timer(stmt="df.loc[:,:] = df.loc[df[1] > 0, :];df.dropna(inplace = True,axis =0)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[8]: 23.69621358400036
In [9]: timeit.Timer(stmt="df.drop(df.index[df[1] < 0],axis=0,inplace=True)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[9]: 11.476448250003159