Pandas:过滤多列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46653647/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: filter on multiple columns
提问by M. K. Hunter
I am working in Pandas, and I want to apply multiple filters to a data frame across multiple fields.
我在 Pandas 工作,我想对跨多个字段的数据框应用多个过滤器。
I am working with another, more complex data frame, but I am simplifying the contex for this question. Here is the setup for a sample data frame:
我正在使用另一个更复杂的数据框,但我正在简化这个问题的上下文。以下是示例数据框的设置:
dates = pd.date_range('20170101', periods=16)
rand_df = pd.DataFrame(np.random.randn(16,4), index=dates, columns=list('ABCD'))
Applying one filter to this data frame is well documented and simple:
对这个数据框应用一个过滤器是有据可查且简单的:
rand_df.loc[lambda df: df['A'] < 0]
Since the lambda looks like a simple boolean expression. It is tempting to do the following. This does not work, since, instead of being a boolean expression, it is a callable. Multiple of these cannot combine as boolean expressions would:
由于 lambda 看起来像一个简单的布尔表达式。做以下事情很诱人。这不起作用,因为它不是布尔表达式,而是可调用的。其中多个不能像布尔表达式那样组合:
rand_df.loc[lambda df: df['A'] < 0 and df[‘B'] < 0]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-dfa05ab293f9> in <module>()
----> 1 rand_df.loc[lambda df: df['A'] < 0 and df['B'] < 0]
I have found two ways to successfully implement this. I will add them to the potential answers, so you can comment directly on them as solutions. However, I would like to solicit other approaches, since I am not really sure that either of these is a very standard approach for filtering a Pandas data frame.
我找到了两种成功实现这一点的方法。我会将它们添加到潜在答案中,因此您可以直接将它们作为解决方案进行评论。但是,我想征求其他方法,因为我不确定这些方法中的任何一个是过滤 Pandas 数据框的非常标准的方法。
回答by MaxU
In [3]: rand_df.query("A < 0 and B < 0")
Out[3]:
A B C D
2017-01-02 -0.701682 -1.224531 -0.273323 -1.091705
2017-01-05 -1.262971 -0.531959 -0.997451 -0.070095
2017-01-06 -0.065729 -1.427199 1.202082 0.136657
2017-01-08 -1.445050 -0.367112 -2.617743 0.496396
2017-01-12 -1.273692 -0.456254 -0.668510 -0.125507
or:
或者:
In [6]: rand_df[rand_df[['A','B']].lt(0).all(1)]
Out[6]:
A B C D
2017-01-02 -0.701682 -1.224531 -0.273323 -1.091705
2017-01-05 -1.262971 -0.531959 -0.997451 -0.070095
2017-01-06 -0.065729 -1.427199 1.202082 0.136657
2017-01-08 -1.445050 -0.367112 -2.617743 0.496396
2017-01-12 -1.273692 -0.456254 -0.668510 -0.125507
PS You will find a lot of examples in the Pandas docs
PS 你会在Pandas 文档中找到很多例子
回答by DJK
rand_df[(rand_df.A < 0) & (rand_df.B <0)]
回答by piRSquared
To use the lambda
, don't pass the entire column.
要使用lambda
,不要传递整列。
rand_df.loc[lambda x: (x.A < 0) & (x.B < 0)]
# Or
# rand_df[lambda x: (x.A < 0) & (x.B < 0)]
A B C D
2017-01-12 -0.460918 -1.001184 -0.796981 0.328535
2017-01-14 -0.146846 -1.088095 -1.055271 -0.778120
You can speed up the evaluation by using boolean numpy arrays
您可以使用布尔 numpy 数组来加速评估
c1 = rand_df.A.values > 0
c2 = rand_df.B.values > 0
rand_df[c1 & c2]
A B C D
2017-01-12 -0.460918 -1.001184 -0.796981 0.328535
2017-01-14 -0.146846 -1.088095 -1.055271 -0.778120
回答by M. K. Hunter
Here is an approach that “chains” use of the ‘loc' operation:
这是一种“链接”使用“loc”操作的方法:
rand_df.loc[lambda df: df['A'] < 0].loc[lambda df: df['B'] < 0]
回答by M. K. Hunter
Here is an approach which includes writing a method to do the filtering. I am sure that some filters will be sufficiently complex or complicated that the method is the best way to go (this case is not so complex.) Also, when I am using Pandas and I write a “for” loop, I feel like I am doing it wrong.
这是一种方法,包括编写一个方法来进行过滤。我确信某些过滤器将足够复杂或复杂,以至于该方法是最好的方法(这种情况并不复杂。)此外,当我使用 Pandas 并编写“for”循环时,我觉得我我做错了。
def lt_zero_ab(df):
result = []
for index, row in df.iterrows():
if row['A'] <0 and row['B'] <0:
result.append(index)
return result
rand_df.loc[lt_zero_ab]