Pandas 使用 bool 过滤 DataFrame 的列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37391539/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas filter columns of a DataFrame with bool
提问by mohitos
For a DataFrame (df) with multiple columns and rows
对于具有多列和多行的 DataFrame (df)
A B C D
0 1 4 2 6
1 2 5 7 4
2 3 6 5 6
and another DataFrame (dfBool) containing dtype: bool
和另一个包含 dtype: bool 的 DataFrame (dfBool)
0 True
1 False
2 False
3 True
What is the easiest way to split this DataFrame by columns into two different DataFrames by transposing dfbool so you get the desired output
通过转置 dfbool 将这个 DataFrame 按列拆分为两个不同的 DataFrame 的最简单方法是什么,以便获得所需的输出
A D
0 1 6
1 2 4
2 3 6
B C
0 4 2
1 5 7
2 6 5
I cannot understand, in my limited experience why dfTrue = df[dfBool.transpose() == True]
does not work
我无法理解,以我有限的经验为什么dfTrue = df[dfBool.transpose() == True]
不起作用
回答by jezrael
I would like to modify EdChum's comment, because if dfBool
is DataFrame
, you have to first select column
:
我想修改EdChum 的评论,因为如果dfBool
是DataFrame
,你必须先选择column
:
import pandas as pd
df = pd.DataFrame({'D': {0: 6, 1: 4, 2: 6},
'A': {0: 1, 1: 2, 2: 3},
'C': {0: 2, 1: 7, 2: 5},
'B': {0: 4, 1: 5, 2: 6}})
print (df)
A B C D
0 1 4 2 6
1 2 5 7 4
2 3 6 5 6
dfBool = pd.DataFrame({'a':[True, False, False, True]})
print (dfBool)
a
0 True
1 False
2 False
3 True
#select first column in dfBool
df2 = (dfBool.iloc[:,0])
#or select column a in dfBool
#df2 = (dfBool.a)
print (df2)
0 True
1 False
2 False
3 True
Name: a, dtype: bool
print (df[df.columns[df2]])
A D
0 1 6
1 2 4
2 3 6
print (df[df.columns[~df2]])
B C
0 4 2
1 5 7
2 6 5
Another very nice solution from ayhan, thank you:
来自ayhan 的另一个非常好的解决方案,谢谢:
print (df.loc[:, dfBool.a.values])
A D
0 1 6
1 2 4
2 3 6
print (df.loc[:, ~dfBool.a.values])
B C
0 4 2
1 5 7
2 6 5
But if dfBool
is Series
, solution works very well:
但如果dfBool
是Series
,解决方案效果很好:
dfBool = pd.Series([True, False, False, True])
print (dfBool)
0 True
1 False
2 False
3 True
dtype: bool
print (df[df.columns[dfBool]])
A D
0 1 6
1 2 4
2 3 6
print (df[df.columns[~dfBool]])
B C
0 4 2
1 5 7
2 6 5
And for Series
:
而对于Series
:
print (df.loc[:, dfBool.values])
A D
0 1 6
1 2 4
2 3 6
print (df.loc[:, ~dfBool.values])
B C
0 4 2
1 5 7
2 6 5
Timings:
时间:
In [277]: %timeit (df[df.columns[dfBool.a]])
1000 loops, best of 3: 769 μs per loop
In [278]: %timeit (df.loc[:, dfBool1.a.values])
The slowest run took 9.08 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 380 μs per loop
In [279]: %timeit (df.transpose()[dfBool1.a.values].transpose())
The slowest run took 5.04 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 550 μs per loop
Code for timings:
计时代码:
import pandas as pd
df = pd.DataFrame({'D': {0: 6, 1: 4, 2: 6},
'A': {0: 1, 1: 2, 2: 3},
'C': {0: 2, 1: 7, 2: 5},
'B': {0: 4, 1: 5, 2: 6}})
print (df)
df = pd.concat([df]*1000, axis=1).reset_index(drop=True)
dfBool = pd.DataFrame({'a': [True, False, False, True]})
dfBool1 = pd.concat([dfBool]*1000).reset_index(drop=True)
Output is little different:
输出略有不同:
print (df[df.columns[dfBool.a]])
A A A A A A A A A A ... D D D D D D D D D D
0 1 1 1 1 1 1 1 1 1 1 ... 6 6 6 6 6 6 6 6 6 6
1 2 2 2 2 2 2 2 2 2 2 ... 4 4 4 4 4 4 4 4 4 4
2 3 3 3 3 3 3 3 3 3 3 ... 6 6 6 6 6 6 6 6 6 6
[3 rows x 2000 columns]
print (df.loc[:, dfBool1.a.values])
A D A D A D A D A D ... A D A D A D A D A D
0 1 6 1 6 1 6 1 6 1 6 ... 1 6 1 6 1 6 1 6 1 6
1 2 4 2 4 2 4 2 4 2 4 ... 2 4 2 4 2 4 2 4 2 4
2 3 6 3 6 3 6 3 6 3 6 ... 3 6 3 6 3 6 3 6 3 6
[3 rows x 2000 columns]
print (df.transpose()[dfBool1.a.values].transpose())
A D A D A D A D A D ... A D A D A D A D A D
0 1 6 1 6 1 6 1 6 1 6 ... 1 6 1 6 1 6 1 6 1 6
1 2 4 2 4 2 4 2 4 2 4 ... 2 4 2 4 2 4 2 4 2 4
2 3 6 3 6 3 6 3 6 3 6 ... 3 6 3 6 3 6 3 6 3 6
[3 rows x 2000 columns]
回答by Greg Lever
Maybe something like the following ?
也许类似于以下内容?
import pandas as pd
totalDF = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [2, 7, 5], 'D': [6, 4, 8]})
dfBool = pd.DataFrame(data=[True, False, False, True])
totalDF.transpose()[dfBool.values].transpose()
A D
0 1 6
1 2 4
2 3 8
totalDF.transpose()[~dfBool.values].transpose()
B C
0 4 2
1 5 7
2 6 5