Python PySpark:when 子句中有多个条件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37707305/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:49:54  来源:igfitidea点击:

PySpark: multiple conditions in when clause

pythonapache-sparkdataframepysparkapache-spark-sql

提问by sjishan

I would like to modify the cell values of a dataframe column (Age) where currently it is blank and I would only do it if another column (Survived) has the value 0 for the corresponding row where it is blank for Age. If it is 1 in the Survived column but blank in Age column then I will keep it as null.

我想修改当前为空白的数据框列(Age)的单元格值,并且只有在另一列(Survived)的对应行的值为 0 时才执行此操作,其中 Age 为空白。如果它在 Survived 列中为 1 但在 Age 列中为空白,那么我将其保留为空。

I tried to use &&operator but it didn't work. Here is my code:

我尝试使用&&运算符,但没有用。这是我的代码:

tdata.withColumn("Age",  when((tdata.Age == "" && tdata.Survived == "0"), mean_age_0).otherwise(tdata.Age)).show()

Any suggestions how to handle that? Thanks.

任何建议如何处理?谢谢。

Error Message:

错误信息:

SyntaxError: invalid syntax
  File "<ipython-input-33-3e691784411c>", line 1
    tdata.withColumn("Age",  when((tdata.Age == "" && tdata.Survived == "0"), mean_age_0).otherwise(tdata.Age)).show()
                                                    ^

回答by zero323

You get SyntaxErrorerror exception because Python has no &&operator. It has andand &where the latter one is the correct choice to create boolean expressions on Column(|for a logical disjunction and ~for logical negation).

您会收到SyntaxError错误异常,因为 Python 没有&&运算符。它具有and并且&后一个是在Column|用于逻辑析取和~逻辑否定)上创建布尔表达式的正确选择。

Condition you created is also invalid because it doesn't consider operator precedence. &in Python has a higher precedence than ==so expression has to be parenthesized.

您创建的条件也无效,因为它不考虑operator precedence&在 Python 中具有更高的优先级,==因此表达式必须用括号括起来。

(col("Age") == "") & (col("Survived") == "0")
## Column<b'((Age = ) AND (Survived = 0))'>

On a side note whenfunction is equivalent to caseexpression not WHENclause. Still the same rules apply. Conjunction:

旁注when函数等效于case表达式 notWHEN子句。仍然适用相同的规则。连词:

df.where((col("foo") > 0) & (col("bar") < 0))

Disjunction:

分离:

df.where((col("foo") > 0) | (col("bar") < 0))

You can of course define conditions separately to avoid brackets:

您当然可以单独定义条件以避免括号:

cond1 = col("Age") == "" 
cond2 = col("Survived") == "0"

cond1 & cond2

回答by vj sreenivasan

whenin pysparkmultiple conditions can be built using &(for and) and |(for or).

pyspark 中,可以使用&(for and) 和|构建多个条件 (为或)。

Note:In pysparkt is important to enclose every expressions within parenthesis () that combine to form the condition

注意:在pyspark 中,将每个表达式括在括号 () 中很重要,这些表达式组合在一起形成条件

%pyspark
dataDF = spark.createDataFrame([(66, "a", "4"), 
                                (67, "a", "0"), 
                                (70, "b", "4"), 
                                (71, "d", "4")],
                                ("id", "code", "amt"))
dataDF.withColumn("new_column",
       when((col("code") == "a") | (col("code") == "d"), "A")
      .when((col("code") == "b") & (col("amt") == "4"), "B")
      .otherwise("A1")).show()

In Spark Scala code (&&) or (||) conditions can be used within whenfunction

在 Spark Scala 代码中 ( &&) 或 ( ||) 条件可以在when函数中使用

//scala
val dataDF = Seq(
      (66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4"
      )).toDF("id", "code", "amt")
dataDF.withColumn("new_column",
       when(col("code") === "a" || col("code") === "d", "A")
      .when(col("code") === "b" && col("amt") === "4", "B")
      .otherwise("A1")).show()

=======================

========================

Output:
+---+----+---+----------+
| id|code|amt|new_column|
+---+----+---+----------+
| 66|   a|  4|         A|
| 67|   a|  0|         A|
| 70|   b|  4|         B|
| 71|   d|  4|         A|
+---+----+---+----------+

This code snippet is copied from sparkbyexamples.com

此代码片段是从sparkbyexamples.com复制的

回答by Jose Alberto Gonzalez

it should works at least in pyspark 2.4

它应该至少在 pyspark 2.4 中工作

tdata = tdata.withColumn("Age",  when((tdata.Age == "") & (tdata.Survived == "0") , "NewValue").otherwise(tdata.Age))

回答by mahatmawx

It should be:

它应该是:

$when(((tdata.Age == "" ) & (tdata.Survived == "0")), mean_age_0)