scala 如何从 SparkSQL DataFrame 中的 MapType 列中获取键和值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40602606/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:50:47  来源:igfitidea点击:

How to get keys and values from MapType column in SparkSQL DataFrame

scalaapache-sparkdataframeapache-spark-sqlapache-spark-dataset

提问by lloydh

I have data in a parquet file which has 2 fields: object_id: Stringand alpha: Map<>.

我在一个镶木地板文件中有数据,它有 2 个字段:object_id: Stringalpha: Map<>.

It is read into a data frame in sparkSQL and the schema looks like this:

它被读入 sparkSQL 中的数据框,架构如下所示:

scala> alphaDF.printSchema()
root
 |-- object_id: string (nullable = true)
 |-- ALPHA: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)

I am using Spark 2.0 and I am trying to create a new data frame in which columns need to be object_idplus keys of the ALPHAmap as in object_id, key1, key2, key2, ...

我正在使用 Spark 2.0,我正在尝试创建一个新的数据框,其中的列需要object_id加上ALPHA地图的键,如object_id, key1, key2, key2, ...

I was first trying to see if I could at least access the map like this:

我首先试图看看我是否至少可以像这样访问地图:

scala> alphaDF.map(a => a(0)).collect()
<console>:32: error: Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are 
supported by importing spark.implicits._  Support for serializing other
types will be added in future releases.
   alphaDF.map(a => a(0)).collect()

but unfortunately I can't seem to be able to figure out how to access the keys of the map.

但不幸的是,我似乎无法弄清楚如何访问地图的键。

Can someone please show me a way to get the object_idplus map keys as column names and map values as respective values in a new dataframe?

有人可以告诉我一种方法来获取object_id加号映射键作为列名和映射值作为新数据框中的相应值吗?

回答by zero323

Spark >= 2.3

火花 >= 2.3

You can simplify the process using map_keysfunction:

您可以使用以下map_keys功能简化流程:

import org.apache.spark.sql.functions.map_keys

There is also map_valuesfunction, but it won't be directly useful here.

也有map_values函数,但在这里不会直接有用。

Spark < 2.3

火花 < 2.3

General method can be expressed in a few steps. First required imports:

一般方法可以用几个步骤来表示。首先需要导入:

import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.Row

and example data:

和示例数据:

val ds = Seq(
  (1, Map("foo" -> (1, "a"), "bar" -> (2, "b"))),
  (2, Map("foo" -> (3, "c"))),
  (3, Map("bar" -> (4, "d")))
).toDF("id", "alpha")

To extract keys we can use UDF (Spark < 2.3)

要提取密钥,我们可以使用 UDF (Spark < 2.3)

val map_keys = udf[Seq[String], Map[String, Row]](_.keys.toSeq)

or built-in functions

或内置函数

import org.apache.spark.sql.functions.map_keys

val keysDF = df.select(map_keys($"alpha"))

Find distinct ones:

找到不同的:

val distinctKeys = keysDF.as[Seq[String]].flatMap(identity).distinct
  .collect.sorted

You can also generalize keysextraction with explode:

您还可以使用以下方法概括keys提取explode

import org.apache.spark.sql.functions.explode

val distinctKeys = df
  // Flatten the column into key, value columns
 .select(explode($"alpha"))
 .select($"key")
 .as[String].distinct
 .collect.sorted

And select:

并且select

ds.select($"id" +: distinctKeys.map(x => $"alpha".getItem(x).alias(x)): _*)

回答by Hailin FU

And if you are in PySpark, I just find an easy implementation:

如果你在 PySpark,我只是找到一个简单的实现:

from pyspark.sql.functions import map_keys

alphaDF.select(map_keys("ALPHA").alias("keys")).show()

You can check details in here

您可以在此处查看详细信息