SQL PySpark 中的比较运算符(不等于/!=)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39120934/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Comparison operator in PySpark (not equal/ !=)
提问by Hendrik F
I am trying to obtain all rows in a dataframe where two flags are set to '1' and subsequently all those that where only one of two is set to '1' and the other NOT EQUALto '1'
我试图获取数据帧中的所有行,其中两个标志设置为“1”,随后所有那些只有两个标志中的一个设置为“1”,另一个不等于“1”
With the following schema (three columns),
使用以下架构(三列),
df = sqlContext.createDataFrame([('a',1,'null'),('b',1,1),('c',1,'null'),('d','null',1),('e',1,1)], #,('f',1,'NaN'),('g','bla',1)],
schema=('id', 'foo', 'bar')
)
I obtain the following dataframe:
我获得以下数据框:
+---+----+----+
| id| foo| bar|
+---+----+----+
| a| 1|null|
| b| 1| 1|
| c| 1|null|
| d|null| 1|
| e| 1| 1|
+---+----+----+
When I apply the desired filters, the first filter (foo=1 AND bar=1) works, but not the other (foo=1 AND NOT bar=1)
当我应用所需的过滤器时,第一个过滤器(foo=1 AND bar=1)有效,但另一个无效(foo=1 AND NOT bar=1)
foobar_df = df.filter( (df.foo==1) & (df.bar==1) )
yields:
产量:
+---+---+---+
| id|foo|bar|
+---+---+---+
| b| 1| 1|
| e| 1| 1|
+---+---+---+
Here is the non-behaving filter:
这是非行为过滤器:
foo_df = df.filter( (df.foo==1) & (df.bar!=1) )
foo_df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
+---+---+---+
Why is it not filtering? How can I get the columns where only foo is equal to '1'?
为什么不过滤?如何获得只有 foo 等于“1”的列?
回答by zero323
Why is it not filtering
为什么不过滤
Because it is SQL and NULL
indicates missing values. Because of that any comparison to NULL
, other than IS NULL
and IS NOT NULL
is undefined. You need either:
因为它是 SQL 并NULL
指示缺失值。因此NULL
,除IS NULL
和之外的任何与 的比较IS NOT NULL
都是未定义的。您需要:
col("bar").isNull() | (col("bar") != 1)
or
或者
coalesce(col("bar") != 1, lit(True))
or (PySpark >= 2.3):
或(PySpark >= 2.3):
col("bar").eqNullSafe(1)
if you want null safe comparisons in PySpark.
如果你想在 PySpark 中进行空安全比较。
Also 'null'
is not a valid way to introduce NULL
literal. You should use None
to indicate missing objects.
还'null'
没有引入有效的办法NULL
文字。您应该使用None
来指示丢失的对象。
from pyspark.sql.functions import col, coalesce, lit
df = spark.createDataFrame([
('a', 1, 1), ('a',1, None), ('b', 1, 1),
('c' ,1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')
df.where((col("foo") == 1) & (col("bar").isNull() | (col("bar") != 1))).show()
## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## | a| 1|null|
## | c| 1|null|
## +---+---+----+
df.where((col("foo") == 1) & coalesce(col("bar") != 1, lit(True))).show()
## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## | a| 1|null|
## | c| 1|null|
## +---+---+----+
回答by johnaphun
To filter null values try:
要过滤空值,请尝试:
foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )
foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull