Python 数据框上的多条件过滤器

Question

提问by femibyte

Can anyone explain to me why I am getting different results for these 2 expressions ? I am trying to filter between 2 dates:

谁能向我解释为什么这两个表达式得到不同的结果？我正在尝试在 2 个日期之间进行过滤：

df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

Result : 37M

结果：37M

vs

对比

df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

Result: 25M

结果：25M

How are they different ? It seems to me like they should produce the same result

它们有何不同？在我看来他们应该产生相同的结果

Answer 1

回答by zero323

TL;DRTo pass multiple conditions to filteror whereuse Columnobjects and logical operators (&, |, ~). See Pyspark: multiple conditions in when clause.

TL;DR将多个条件传递给filter或where使用Column对象和逻辑运算符 ( &, |, ~)。请参阅Pyspark：when 子句中的多个条件。

df.filter((col("act_date") >= "2016-10-01") & (col("act_date") <= "2017-04-01"))

You can also use a singleSQL string:

您还可以使用单个SQL 字符串：

df.filter("act_date >='2016-10-01' AND act_date <='2017-04-01'")

In practice it makes more sense to use between:

在实践中，在以下之间使用更有意义：

df.filter(col("act_date").between("2016-10-01", "2017-04-01"))
df.filter("act_date BETWEEN '2016-10-01' AND '2017-04-01'")

The first approach is not even remote valid. In Python, andreturns:

第一种方法甚至不是远程有效的。在 Python 中，and返回：

The last element if all expressions are "truthy".
The first "falsey" element otherwise.

如果所有表达式都是“真实的”，则为最后一个元素。
否则第一个“falsey”元素。

As a result

其结果

"act_date <='2017-04-01'" and "act_date >='2016-10-01'"

is evaluated to (any non-empty string is truthy):

被评估为（任何非空字符串都是真的）：

"act_date >='2016-10-01'"

Answer 2

回答by Ash Man

In first case

在第一种情况下

df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

the result is values more than 2016-10-01 that means all the values above 2017-04-01 also.

结果是大于 2016-10-01 的值，这意味着也大于 2017-04-01 的所有值。

Whereas in second case

而在第二种情况下

df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

the result is the values between 2016-10-01 to 2017-04-01.

结果是 2016-10-01 到 2017-04-01 之间的值。

Python 数据框上的多条件过滤器

提问by femibyte

回答by zero323

回答by Ash Man

相关推荐

最近更新

标签

Python 数据框上的多条件过滤器

提问by femibyte

回答by zero323

回答by Ash Man

相关推荐

Python 类型错误：src 不是 numpy 数组，也不是标量

python 中的 json.dump() 和 json.dumps() 有什么区别？

Python 对 Pandas 数据集执行 SQL 查询

Python Matplotlib - 同时在 3D 中绘制平面和点

相关推荐

最近更新

标签