Python 数据框上的多条件过滤器
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45978108/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Multiple condition filter on dataframe
提问by femibyte
Can anyone explain to me why I am getting different results for these 2 expressions ? I am trying to filter between 2 dates:
谁能向我解释为什么这两个表达式得到不同的结果?我正在尝试在 2 个日期之间进行过滤:
df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\
.select("col1","col2").distinct().count()
Result : 37M
结果:37M
vs
对比
df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\
.select("col1","col2").distinct().count()
Result: 25M
结果:25M
How are they different ? It seems to me like they should produce the same result
它们有何不同?在我看来他们应该产生相同的结果
回答by zero323
TL;DRTo pass multiple conditions to filter
or where
use Column
objects and logical operators (&
, |
, ~
). See Pyspark: multiple conditions in when clause.
TL;DR将多个条件传递给filter
或where
使用Column
对象和逻辑运算符 ( &
, |
, ~
)。请参阅Pyspark:when 子句中的多个条件。
df.filter((col("act_date") >= "2016-10-01") & (col("act_date") <= "2017-04-01"))
You can also use a singleSQL string:
您还可以使用单个SQL 字符串:
df.filter("act_date >='2016-10-01' AND act_date <='2017-04-01'")
In practice it makes more sense to use between:
在实践中,在以下之间使用更有意义:
df.filter(col("act_date").between("2016-10-01", "2017-04-01"))
df.filter("act_date BETWEEN '2016-10-01' AND '2017-04-01'")
The first approach is not even remote valid. In Python, and
returns:
第一种方法甚至不是远程有效的。在 Python 中,and
返回:
- The last element if all expressions are "truthy".
- The first "falsey" element otherwise.
- 如果所有表达式都是“真实的”,则为最后一个元素。
- 否则第一个“falsey”元素。
As a result
其结果
"act_date <='2017-04-01'" and "act_date >='2016-10-01'"
is evaluated to (any non-empty string is truthy):
被评估为(任何非空字符串都是真的):
"act_date >='2016-10-01'"
回答by Ash Man
In first case
在第一种情况下
df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\
.select("col1","col2").distinct().count()
the result is values more than 2016-10-01 that means all the values above 2017-04-01 also.
结果是大于 2016-10-01 的值,这意味着也大于 2017-04-01 的所有值。
Whereas in second case
而在第二种情况下
df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\
.select("col1","col2").distinct().count()
the result is the values between 2016-10-01 to 2017-04-01.
结果是 2016-10-01 到 2017-04-01 之间的值。