scala 在 Spark SQL 中将数组作为 UDF 参数传递
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31036567/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pass array as an UDF parameter in Spark SQL
提问by J Calbreath
I'm trying to transform a dataframe via a function that takes an array as a parameter. My code looks something like this:
我正在尝试通过将数组作为参数的函数来转换数据帧。我的代码看起来像这样:
def getCategory(categories:Array[String], input:String): String = {
categories(input.toInt)
}
val myArray = Array("a", "b", "c")
val myCategories =udf(getCategory _ )
val df = sqlContext.parquetFile("myfile.parquet)
val df1 = df.withColumn("newCategory", myCategories(lit(myArray), col("myInput"))
However, lit doesn't like arrays and this script errors. I tried definining a new partially applied function and then the udf after that :
然而,lit 不喜欢数组和这个脚本错误。我尝试定义一个新的部分应用函数,然后定义 udf:
val newFunc = getCategory(myArray, _:String)
val myCategories = udf(newFunc)
val df1 = df.withColumn("newCategory", myCategories(col("myInput")))
This doesn't work either as I get a nullPointer exception and it appears myArray is not being recognized. Any ideas on how I pass an array as a parameter to a function with a dataframe?
这也不起作用,因为我收到了 nullPointer 异常,并且似乎 myArray 未被识别。关于如何将数组作为参数传递给具有数据框的函数的任何想法?
On a separate note, any explanation as to why doing something simple like using a function on a dataframe is so complicated (define function, redefine it as UDF, etc, etc)?
另外,关于为什么在数据帧上使用函数之类的简单操作如此复杂(定义函数,将其重新定义为 UDF 等)的任何解释?
回答by zero323
Most likely not the prettiest solution but you can try something like this:
很可能不是最漂亮的解决方案,但您可以尝试以下操作:
def getCategory(categories: Array[String]) = {
udf((input:String) => categories(input.toInt))
}
df.withColumn("newCategory", getCategory(myArray)(col("myInput")))
You could also try an arrayof literals:
你也可以尝试一个array文字:
val getCategory = udf(
(input:String, categories: Array[String]) => categories(input.toInt))
df.withColumn(
"newCategory", getCategory($"myInput", array(myArray.map(lit(_)): _*)))
On a side note using Mapinstead of Arrayis probably a better idea:
在旁注中使用Map而不是Array可能是一个更好的主意:
def mapCategory(categories: Map[String, String], default: String) = {
udf((input:String) => categories.getOrElse(input, default))
}
val myMap = Map[String, String]("1" -> "a", "2" -> "b", "3" -> "c")
df.withColumn("newCategory", mapCategory(myMap, "foo")(col("myInput")))
Since Spark 1.5.0 you can also use an arrayfunction:
从 Spark 1.5.0 开始,您还可以使用一个array函数:
import org.apache.spark.sql.functions.array
val colArray = array(myArray map(lit _): _*)
myCategories(lit(colArray), col("myInput"))
See also Spark UDF with varargs

