scala 如何将 Spark SQL DataFrame 与 flatMap 一起使用?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30381359/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:10:51  来源:igfitidea点击:

How to use Spark SQL DataFrame with flatMap?

scalaapache-sparkapache-spark-sql

提问by Yuri Brovman

I am using the Spark Scala API. I have a Spark SQL DataFrame (read from an Avro file) with the following schema:

我正在使用 Spark Scala API。我有一个具有以下架构的 Spark SQL DataFrame(从 Avro 文件中读取):

root
|-- ids: array (nullable = true)
|    |-- element: map (containsNull = true)
|    |    |-- key: integer
|    |    |-- value: string (valueContainsNull = true)
|-- match: array (nullable = true)
|    |-- element: integer (containsNull = true)

Essentially 2 columns [ ids: List[Map[Int, String]], match: List[Int] ]. Sample data that looks like:

本质上是 2 列 [ids: List[Map[Int, String]], match: List[Int]]。示例数据如下:

[List(Map(1 -> a), Map(2 -> b), Map(3 -> c), Map(4 -> d)),List(0, 0, 1, 0)]
[List(Map(5 -> c), Map(6 -> a), Map(7 -> e), Map(8 -> d)),List(1, 0, 1, 0)]
...

What I would like to do is flatMap()each row to produce 3 columns [id, property, match]. Using the above 2 rows as the input data we would get:

我想做的是flatMap()每一行产生 3 列 [ id, property, match]。使用上面的 2 行作为输入数据,我们将得到:

[1,a,0]
[2,b,0]
[3,c,1]
[4,d,0]
[5,c,1]
[6,a,0]
[7,e,1]
[8,d,0]
...

and then groupBythe Stringproperty(ex: a, b, ...) to produce count("property")and sum("match"):

然后groupBy是要生产和的String属性(例如:a、b、...):count("property")sum("match")

 a    2    0
 b    1    0
 c    2    2
 d    2    0
 e    1    1

I would want to do something like:

我想做类似的事情:

val result = myDataFrame.select("ids","match").flatMap( 
    (row: Row) => row.getList[Map[Int,String]](1).toArray() )
result.groupBy("property").agg(Map(
    "property" -> "count",
    "match" -> "sum" ) )

The problem is that the flatMapconverts DataFrame to RDD. Is there a good way to do a flatMaptype operation followed by groupByusing DataFrames?

问题是flatMap将 DataFrame 转换为 RDD。有没有一种好方法可以在使用 DataFrameflatMap之后进行类型操作groupBy

回答by David Griffin

What does flatMapdo that you want? It converts each input row into 0 or more rows. It can filter them out, or it can add new ones. In SQL to get the same functionality you use join. Can you do what you want to do with a join?

这是什么flatMap做的,你想要什么?它将每个输入行转换为 0 行或更多行。它可以过滤掉它们,也可以添加新的。在 SQL 中获得与您使用相同的功能join。你能用 a 做你想做的事join吗?

Alternatively, you could also look at Dataframe.explode, which is just a specific kind of join(you can easily craft your own explodeby joining a DataFrame to a UDF). explodetakes a single column as input and lets you split it or convert it into multiple values and then jointhe original row back onto the new rows. So:

或者,您也可以查看Dataframe.explode,这只是一种特定类型join(您可以explode通过将 DataFrame 加入 UDF轻松制作自己的)。explode将单列作为输入,并允许您将其拆分或转换为多个值,然后join将原始行重新转换为新行。所以:

user      groups
griffin   mkt,it,admin

Could become:

可以变成:

user      group
griffin   mkt
griffin   it
griffin   admin

So I would say take a look at DataFrame.explodeand if that doesn't get you there easily, try joins with UDFs.

所以我想说看看,DataFrame.explode如果这不能让您轻松到达那里,请尝试与 UDF 连接。

回答by Holden

My SQL is a bit rusty, but one option is in your flatMap to produce a list of Row objects and then you can convert the resulting RDD back into a DataFrame.

我的 SQL 有点生疏,但一个选项是在您的 flatMap 中生成一个 Row 对象列表,然后您可以将生成的 RDD 转换回 DataFrame。