scala 如何将 Spark SQL DataFrame 与 flatMap 一起使用？

Question

提问by Yuri Brovman

I am using the Spark Scala API. I have a Spark SQL DataFrame (read from an Avro file) with the following schema:

我正在使用 Spark Scala API。我有一个具有以下架构的 Spark SQL DataFrame（从 Avro 文件中读取）：

root
|-- ids: array (nullable = true)
|    |-- element: map (containsNull = true)
|    |    |-- key: integer
|    |    |-- value: string (valueContainsNull = true)
|-- match: array (nullable = true)
|    |-- element: integer (containsNull = true)

Essentially 2 columns [ ids: List[Map[Int, String]], match: List[Int] ]. Sample data that looks like:

本质上是 2 列 [ids: List[Map[Int, String]], match: List[Int]]。示例数据如下：

[List(Map(1 -> a), Map(2 -> b), Map(3 -> c), Map(4 -> d)),List(0, 0, 1, 0)]
[List(Map(5 -> c), Map(6 -> a), Map(7 -> e), Map(8 -> d)),List(1, 0, 1, 0)]
...

What I would like to do is flatMap()each row to produce 3 columns [id, property, match]. Using the above 2 rows as the input data we would get:

我想做的是flatMap()每一行产生 3 列 [ id, property, match]。使用上面的 2 行作为输入数据，我们将得到：

[1,a,0]
[2,b,0]
[3,c,1]
[4,d,0]
[5,c,1]
[6,a,0]
[7,e,1]
[8,d,0]
...

and then groupBythe Stringproperty(ex: a, b, ...) to produce count("property")and sum("match"):

然后groupBy是要生产和的String属性（例如：a、b、...）：count("property")sum("match")

 a    2    0
 b    1    0
 c    2    2
 d    2    0
 e    1    1

I would want to do something like:

我想做类似的事情：

val result = myDataFrame.select("ids","match").flatMap( 
    (row: Row) => row.getList[Map[Int,String]](1).toArray() )
result.groupBy("property").agg(Map(
    "property" -> "count",
    "match" -> "sum" ) )

The problem is that the flatMapconverts DataFrame to RDD. Is there a good way to do a flatMaptype operation followed by groupByusing DataFrames?

问题是flatMap将 DataFrame 转换为 RDD。有没有一种好方法可以在使用 DataFrameflatMap之后进行类型操作groupBy？

Answer 1

回答by David Griffin

What does flatMapdo that you want? It converts each input row into 0 or more rows. It can filter them out, or it can add new ones. In SQL to get the same functionality you use join. Can you do what you want to do with a join?

这是什么flatMap做的，你想要什么？它将每个输入行转换为 0 行或更多行。它可以过滤掉它们，也可以添加新的。在 SQL 中获得与您使用相同的功能join。你能用 a 做你想做的事join吗？

Alternatively, you could also look at Dataframe.explode, which is just a specific kind of join(you can easily craft your own explodeby joining a DataFrame to a UDF). explodetakes a single column as input and lets you split it or convert it into multiple values and then jointhe original row back onto the new rows. So:

或者，您也可以查看Dataframe.explode，这只是一种特定类型join（您可以explode通过将 DataFrame 加入 UDF轻松制作自己的）。explode将单列作为输入，并允许您将其拆分或转换为多个值，然后join将原始行重新转换为新行。所以：

user      groups
griffin   mkt,it,admin

Could become:

可以变成：

user      group
griffin   mkt
griffin   it
griffin   admin

So I would say take a look at DataFrame.explodeand if that doesn't get you there easily, try joins with UDFs.

所以我想说看看，DataFrame.explode如果这不能让您轻松到达那里，请尝试与 UDF 连接。

Answer 2

回答by Holden

My SQL is a bit rusty, but one option is in your flatMap to produce a list of Row objects and then you can convert the resulting RDD back into a DataFrame.

我的 SQL 有点生疏，但一个选项是在您的 flatMap 中生成一个 Row 对象列表，然后您可以将生成的 RDD 转换回 DataFrame。

scala 如何将 Spark SQL DataFrame 与 flatMap 一起使用？

提问by Yuri Brovman

回答by David Griffin

回答by Holden

相关推荐

最近更新

标签

scala 如何将 Spark SQL DataFrame 与 flatMap 一起使用？

提问by Yuri Brovman

回答by David Griffin

回答by Holden

相关推荐

Scala spark中的RDD过滤器

scala Apache Spark：执行程序之间的网络错误

scala 如何处理 spark 中的错误 SPARK-5063

scala 在 Spark 中将字符串字段转换为时间戳的更好方法

相关推荐

最近更新

标签