SQL PySpark 中的比较运算符（不等于/！=）

Question

提问by Hendrik F

I am trying to obtain all rows in a dataframe where two flags are set to '1' and subsequently all those that where only one of two is set to '1' and the other NOT EQUALto '1'

我试图获取数据帧中的所有行，其中两个标志设置为“1”，随后所有那些只有两个标志中的一个设置为“1”，另一个不等于“1”

With the following schema (three columns),

使用以下架构（三列），

df = sqlContext.createDataFrame([('a',1,'null'),('b',1,1),('c',1,'null'),('d','null',1),('e',1,1)], #,('f',1,'NaN'),('g','bla',1)],
                            schema=('id', 'foo', 'bar')
                            )

I obtain the following dataframe:

我获得以下数据框：

+---+----+----+
| id| foo| bar|
+---+----+----+
|  a|   1|null|
|  b|   1|   1|
|  c|   1|null|
|  d|null|   1|
|  e|   1|   1|
+---+----+----+

When I apply the desired filters, the first filter (foo=1 AND bar=1) works, but not the other (foo=1 AND NOT bar=1)

当我应用所需的过滤器时，第一个过滤器（foo=1 AND bar=1）有效，但另一个无效（foo=1 AND NOT bar=1）

foobar_df = df.filter( (df.foo==1) & (df.bar==1) )

yields:

产量：

+---+---+---+
| id|foo|bar|
+---+---+---+
|  b|  1|  1|
|  e|  1|  1|
+---+---+---+

Here is the non-behaving filter:

这是非行为过滤器：

foo_df = df.filter( (df.foo==1) & (df.bar!=1) )
foo_df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
+---+---+---+

Why is it not filtering? How can I get the columns where only foo is equal to '1'?

为什么不过滤？如何获得只有 foo 等于“1”的列？

Answer 1

回答by zero323

Why is it not filtering

为什么不过滤

Because it is SQL and NULLindicates missing values. Because of that any comparison to NULL, other than IS NULLand IS NOT NULLis undefined. You need either:

因为它是 SQL 并NULL指示缺失值。因此NULL，除IS NULL和之外的任何与的比较IS NOT NULL都是未定义的。您需要：

col("bar").isNull() | (col("bar") != 1)

or

或者

coalesce(col("bar") != 1, lit(True))

or (PySpark >= 2.3):

或（PySpark >= 2.3）：

col("bar").eqNullSafe(1)

if you want null safe comparisons in PySpark.

如果你想在 PySpark 中进行空安全比较。

Also 'null'is not a valid way to introduce NULLliteral. You should use Noneto indicate missing objects.

还'null'没有引入有效的办法NULL文字。您应该使用None来指示丢失的对象。

from pyspark.sql.functions import col, coalesce, lit

df = spark.createDataFrame([
    ('a', 1, 1), ('a',1, None), ('b', 1, 1),
    ('c' ,1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')

df.where((col("foo") == 1) & (col("bar").isNull() | (col("bar") != 1))).show()

## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## |  a|  1|null|
## |  c|  1|null|
## +---+---+----+

df.where((col("foo") == 1) & coalesce(col("bar") != 1, lit(True))).show()

## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## |  a|  1|null|
## |  c|  1|null|
## +---+---+----+

Answer 2

回答by johnaphun

To filter null values try:

要过滤空值，请尝试：

foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )

https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull

SQL PySpark 中的比较运算符（不等于/！=）

提问by Hendrik F

回答by zero323

回答by johnaphun

相关推荐

最近更新

标签

SQL PySpark 中的比较运算符（不等于/！=）

提问by Hendrik F

回答by zero323

回答by johnaphun

相关推荐

SQL 在 Oracle 中创建 CTE

SQL Server 2008 查询以查找列中包含非字母数字字符的行

Postgres SQL 中的 `->>` 和 `->` 有什么区别？

SQL Scope_Identity()、Identity()、@@Identity 和 Ident_Current() 之间有什么区别？

相关推荐

最近更新

标签