Python 数据框上的多条件过滤器

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45978108/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:23:52  来源:igfitidea点击:

Multiple condition filter on dataframe

pythonapache-sparkdataframepysparkapache-spark-sql

提问by femibyte

Can anyone explain to me why I am getting different results for these 2 expressions ? I am trying to filter between 2 dates:

谁能向我解释为什么这两个表达式得到不同的结果?我正在尝试在 2 个日期之间进行过滤:

df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

Result : 37M

结果:37M

vs

对比

df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

Result: 25M

结果:25M

How are they different ? It seems to me like they should produce the same result

它们有何不同?在我看来他们应该产生相同的结果

回答by zero323

TL;DRTo pass multiple conditions to filteror whereuse Columnobjects and logical operators (&, |, ~). See Pyspark: multiple conditions in when clause.

TL;DR将多个条件传递给filterwhere使用Column对象和逻辑运算符 ( &, |, ~)。请参阅Pyspark:when 子句中的多个条件

df.filter((col("act_date") >= "2016-10-01") & (col("act_date") <= "2017-04-01"))

You can also use a singleSQL string:

您还可以使用单个SQL 字符串:

df.filter("act_date >='2016-10-01' AND act_date <='2017-04-01'")

In practice it makes more sense to use between:

在实践中,在以下之间使用更有意义:

df.filter(col("act_date").between("2016-10-01", "2017-04-01"))
df.filter("act_date BETWEEN '2016-10-01' AND '2017-04-01'")

The first approach is not even remote valid. In Python, andreturns:

第一种方法甚至不是远程有效的。在 Python 中,and返回:

  • The last element if all expressions are "truthy".
  • The first "falsey" element otherwise.
  • 如果所有表达式都是“真实的”,则为最后一个元素。
  • 否则第一个“falsey”元素。

As a result

其结果

"act_date <='2017-04-01'" and "act_date >='2016-10-01'"

is evaluated to (any non-empty string is truthy):

被评估为(任何非空字符串都是真的):

"act_date >='2016-10-01'"

回答by Ash Man

In first case

在第一种情况下

df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

the result is values more than 2016-10-01 that means all the values above 2017-04-01 also.

结果是大于 2016-10-01 的值,这意味着也大于 2017-04-01 的所有值。

Whereas in second case

而在第二种情况下

df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

the result is the values between 2016-10-01 to 2017-04-01.

结果是 2016-10-01 到 2017-04-01 之间的值。