scala 将函数应用于 Spark Dataframe 列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35227568/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Applying function to Spark Dataframe Column
提问by Michael Discenza
Coming from R, I am used to easily doing operations on columns. Is there any easy way to take this function that I've written in scala
来自 R,我习惯于轻松地对列进行操作。有没有什么简单的方法可以使用我用 Scala 编写的这个函数
def round_tenths_place( un_rounded:Double ) : Double = {
val rounded = BigDecimal(un_rounded).setScale(1, BigDecimal.RoundingMode.HALF_UP).toDouble
return rounded
}
And apply it to a one column of a dataframe - kind of what I hoped this would do:
并将其应用于数据框的一列 - 我希望这会做什么:
bid_results.withColumn("bid_price_bucket", round_tenths_place(bid_results("bid_price")) )
I haven't found any easy way and am struggling to figure out how to do this. There's got to be an easier way than converting the dataframe to and RDD and then selecting from rdd of rows to get the right field and mapping the function across all of the values, yeah? And also something more succinct creating a SQL table and then doing this with a sparkSQL UDF?
我还没有找到任何简单的方法,并且正在努力弄清楚如何做到这一点。必须有一种比将数据帧转换为和 RDD 然后从行的 rdd 中选择以获取正确的字段并在所有值上映射函数更简单的方法,是吗?还有更简洁的方法创建 SQL 表,然后使用 sparkSQL UDF 执行此操作吗?
回答by zero323
You can define an UDF as follows:
您可以按如下方式定义 UDF:
val round_tenths_place_udf = udf(round_tenths_place _)
bid_results.withColumn(
"bid_price_bucket", val round_tenths_place_udf($"bid_price"))
although built-in Roundexpressionis using exactly the same logic as your function and should be more than enough, not to mention much more efficient:
尽管内置Round表达式使用与您的函数完全相同的逻辑并且应该绰绰有余,更不用说效率更高了:
import org.apache.spark.sql.functions.round
bid_results.withColumn("bid_price_bucket", round($"bid_price", 1))
See also:
也可以看看:

