Python 在 Pyspark SQL 中你需要在哪里使用 lit()?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37715060/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Where do you need to use lit() in Pyspark SQL?
提问by flybonzai
I'm trying to make sense of where you need to use a lit
value, which is defined as a literal column
in the documentation.
我试图弄清楚您需要在哪里使用lit
值,该值literal column
在文档中定义为 a 。
Take for example this udf
, which returns the index of a SQL column array:
以 this 为例udf
,它返回一个 SQL 列数组的索引:
def find_index(column, index):
return column[index]
If I were to pass an integer into this I would get an error. I would need to pass a lit(n)
value into the udf to get the correct index of an array.
如果我将一个整数传递给这个,我会得到一个错误。我需要将一个lit(n)
值传递给udf 以获得数组的正确索引。
Is there a place I can better learn the hard and fast rules of when to use lit
and possibly col
as well?
有没有什么地方可以让我更好地学习何时使用lit
以及可能使用的硬性规则col
?
回答by zero323
To keep it simple you need a Column
(can be a one created using lit
but it is not the only option) when JVM counterpart expects a column and there is no internal conversion in a Python wrapper or you wan to call a Column
specific method.
为了简单起见,当 JVM 对应物需要一个列并且 Python 包装器中没有内部转换或者您想调用特定方法时,您需要一个Column
(可以是一个使用创建的,lit
但它不是唯一的选择)Column
。
In the first case the only strict rule is the on that applies to UDFs. UDF (Python or JVM) can be called only with arguments which are of Column
type. It also typically applies to functions from pyspark.sql.functions
. In other cases it is always best to check documentation and docs string firsts and if it is not sufficient docs of a corresponding Scala counterpart.
在第一种情况下,唯一严格的规则是适用于 UDF 的 on。UDF(Python 或 JVM)只能使用Column
类型的参数调用。它通常也适用于pyspark.sql.functions
. 在其他情况下,最好先检查文档和文档字符串,如果没有足够的 Scala 对应文档。
In the second case rules are simple. If you for example want to compare a column to a value then value has to be on the RHS:
在第二种情况下,规则很简单。例如,如果您想将列与值进行比较,则值必须位于 RHS 上:
col("foo") > 0 # OK
or value has to be wrapped with literal:
或值必须用文字包装:
lit(0) < col("foo") # OK
In Python many operators (<
, ==
, <=
, &
, |
, +
, -
, *
, /
) can use non column object on the LHS:
在 Python 中,许多运算符(<
, ==
, <=
, &
, |
, +
, -
, *
, /
)可以在 LHS 上使用非列对象:
0 < col("foo")
but such applications are not supported in Scala.
但 Scala 不支持此类应用程序。
It goes without saying that you have to use lit
if you want to access any of the pyspark.sql.Column
methodstreating standard Python scalar as a constant column. For example you'll need
不用说,lit
如果您想访问任何pyspark.sql.Column
将标准 Python 标量视为常量列的方法,则必须使用它。例如你需要
c = lit(1)
not
不是
c = 1
c = 1
to
到
c.between(0, 3) # type: pyspark.sql.Column
回答by Megha Jaiswal
simple example could be:
简单的例子可能是:
df.withColumn("columnName", lit(Column_Value ))
ex:
前任:
df = df.withColumn("Today's Date", lit(datetime.now()))
But first import library:
但首先导入库:
from pyspark.sql.functions import lit