SQL PySpark 中的比较运算符(不等于/!=)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39120934/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 04:48:54  来源:igfitidea点击:

Comparison operator in PySpark (not equal/ !=)

sqlapache-sparkpysparknullapache-spark-sql

提问by Hendrik F

I am trying to obtain all rows in a dataframe where two flags are set to '1' and subsequently all those that where only one of two is set to '1' and the other NOT EQUALto '1'

我试图获取数据帧中的所有行,其中两个标志设置为“1”,随后所有那些只有两个标志中的一个设置为“1”,另一个不等于“1”

With the following schema (three columns),

使用以下架构(三列),

df = sqlContext.createDataFrame([('a',1,'null'),('b',1,1),('c',1,'null'),('d','null',1),('e',1,1)], #,('f',1,'NaN'),('g','bla',1)],
                            schema=('id', 'foo', 'bar')
                            )

I obtain the following dataframe:

我获得以下数据框:

+---+----+----+
| id| foo| bar|
+---+----+----+
|  a|   1|null|
|  b|   1|   1|
|  c|   1|null|
|  d|null|   1|
|  e|   1|   1|
+---+----+----+

When I apply the desired filters, the first filter (foo=1 AND bar=1) works, but not the other (foo=1 AND NOT bar=1)

当我应用所需的过滤器时,第一个过滤器(foo=1 AND bar=1)有效,但另一个无效(foo=1 AND NOT bar=1)

foobar_df = df.filter( (df.foo==1) & (df.bar==1) )

yields:

产量:

+---+---+---+
| id|foo|bar|
+---+---+---+
|  b|  1|  1|
|  e|  1|  1|
+---+---+---+

Here is the non-behaving filter:

这是非行为过滤器:

foo_df = df.filter( (df.foo==1) & (df.bar!=1) )
foo_df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
+---+---+---+

Why is it not filtering? How can I get the columns where only foo is equal to '1'?

为什么不过滤?如何获得只有 foo 等于“1”的列?

回答by zero323

Why is it not filtering

为什么不过滤

Because it is SQL and NULLindicates missing values. Because of that any comparison to NULL, other than IS NULLand IS NOT NULLis undefined. You need either:

因为它是 SQL 并NULL指示缺失值。因此NULL,除IS NULL和之外的任何与 的比较IS NOT NULL都是未定义的。您需要:

col("bar").isNull() | (col("bar") != 1)

or

或者

coalesce(col("bar") != 1, lit(True))

or (PySpark >= 2.3):

或(PySpark >= 2.3):

col("bar").eqNullSafe(1)

if you want null safe comparisons in PySpark.

如果你想在 PySpark 中进行空安全比较。

Also 'null'is not a valid way to introduce NULLliteral. You should use Noneto indicate missing objects.

'null'没有引入有效的办法NULL文字。您应该使用None来指示丢失的对象。

from pyspark.sql.functions import col, coalesce, lit

df = spark.createDataFrame([
    ('a', 1, 1), ('a',1, None), ('b', 1, 1),
    ('c' ,1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')

df.where((col("foo") == 1) & (col("bar").isNull() | (col("bar") != 1))).show()

## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## |  a|  1|null|
## |  c|  1|null|
## +---+---+----+

df.where((col("foo") == 1) & coalesce(col("bar") != 1, lit(True))).show()

## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## |  a|  1|null|
## |  c|  1|null|
## +---+---+----+

回答by johnaphun

To filter null values try:

要过滤空值,请尝试:

foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )

foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )

https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull

https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull