scala 如何将函数应用于 Spark DataFrame 的列?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34614239/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:55:38  来源:igfitidea点击:

How to apply a function to a column of a Spark DataFrame?

scalaapache-sparkdataframeapache-spark-sql

提问by ranlot

Let's assume that we have a Spark DataFrame

假设我们有一个 Spark DataFrame

df.getClass
Class[_ <: org.apache.spark.sql.DataFrame] = class org.apache.spark.sql.DataFrame

with the following schema

具有以下架构

df.printSchema
root
|-- rawFV: string (nullable = true)
|-- tk: array (nullable = true)
|    |-- element: string (containsNull = true)

Given that each row of the tkcolumn is an array of strings, how to write a Scala function that will return the number of elements in each row?

鉴于列的每一行tk都是一个字符串数组,如何编写一个 Scala 函数来返回每行中的元素数?

回答by zero323

You don't have to write a custom function because there is one:

您不必编写自定义函数,因为有一个:

import org.apache.spark.sql.functions.size

df.select(size($"tk"))

If you really want you can write an udf:

如果你真的想要,你可以写一个udf

import org.apache.spark.sql.functions.udf

val size_ = udf((xs: Seq[String]) => xs.size)

or even create custom a expression but there is really no point in that.

甚至创建自定义表达式,但这真的没有意义。

回答by Srini

One way is to access them using the sql like below.

一种方法是使用如下 sql 访问它们。

df.registerTempTable("tab1")
val df2 = sqlContext.sql("select tk[0], tk[1] from tab1")

df2.show()

To get size of array column,

要获取数组列的大小,

val df3 = sqlContext.sql("select size(tk) from tab1")
df3.show()

If your Spark version is older, you can use HiveContext instead of Spark's SQL Context.

如果您的 Spark 版本较旧,您可以使用 HiveContext 而不是 Spark 的 SQL Context。

I would also try for some thing that traverses.

我也会尝试一些遍历的东西。