scala 如何从 SparkSQL DataFrame 中的 MapType 列中获取键和值

Question

提问by lloydh

I have data in a parquet file which has 2 fields: object_id: Stringand alpha: Map<>.

我在一个镶木地板文件中有数据，它有 2 个字段：object_id: String和alpha: Map<>.

It is read into a data frame in sparkSQL and the schema looks like this:

它被读入 sparkSQL 中的数据框，架构如下所示：

scala> alphaDF.printSchema()
root
 |-- object_id: string (nullable = true)
 |-- ALPHA: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)

I am using Spark 2.0 and I am trying to create a new data frame in which columns need to be object_idplus keys of the ALPHAmap as in object_id, key1, key2, key2, ...

我正在使用 Spark 2.0，我正在尝试创建一个新的数据框，其中的列需要object_id加上ALPHA地图的键，如object_id, key1, key2, key2, ...

I was first trying to see if I could at least access the map like this:

我首先试图看看我是否至少可以像这样访问地图：

scala> alphaDF.map(a => a(0)).collect()
<console>:32: error: Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are 
supported by importing spark.implicits._  Support for serializing other
types will be added in future releases.
   alphaDF.map(a => a(0)).collect()

but unfortunately I can't seem to be able to figure out how to access the keys of the map.

但不幸的是，我似乎无法弄清楚如何访问地图的键。

Can someone please show me a way to get the object_idplus map keys as column names and map values as respective values in a new dataframe?

有人可以告诉我一种方法来获取object_id加号映射键作为列名和映射值作为新数据框中的相应值吗？

Answer 1

回答by zero323

Spark >= 2.3

火花 >= 2.3

You can simplify the process using map_keysfunction:

您可以使用以下map_keys功能简化流程：

import org.apache.spark.sql.functions.map_keys

There is also map_valuesfunction, but it won't be directly useful here.

也有map_values函数，但在这里不会直接有用。

Spark < 2.3

火花 < 2.3

General method can be expressed in a few steps. First required imports:

一般方法可以用几个步骤来表示。首先需要导入：

import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.Row

and example data:

和示例数据：

val ds = Seq(
  (1, Map("foo" -> (1, "a"), "bar" -> (2, "b"))),
  (2, Map("foo" -> (3, "c"))),
  (3, Map("bar" -> (4, "d")))
).toDF("id", "alpha")

To extract keys we can use UDF (Spark < 2.3)

要提取密钥，我们可以使用 UDF (Spark < 2.3)

val map_keys = udf[Seq[String], Map[String, Row]](_.keys.toSeq)

or built-in functions

或内置函数

import org.apache.spark.sql.functions.map_keys

val keysDF = df.select(map_keys($"alpha"))

Find distinct ones:

找到不同的：

val distinctKeys = keysDF.as[Seq[String]].flatMap(identity).distinct
  .collect.sorted

You can also generalize keysextraction with explode:

您还可以使用以下方法概括keys提取explode：

import org.apache.spark.sql.functions.explode

val distinctKeys = df
  // Flatten the column into key, value columns
 .select(explode($"alpha"))
 .select($"key")
 .as[String].distinct
 .collect.sorted

And select:

并且select：

ds.select($"id" +: distinctKeys.map(x => $"alpha".getItem(x).alias(x)): _*)

Answer 2

回答by Hailin FU

And if you are in PySpark, I just find an easy implementation:

如果你在 PySpark，我只是找到一个简单的实现：

from pyspark.sql.functions import map_keys

alphaDF.select(map_keys("ALPHA").alias("keys")).show()

You can check details in here

您可以在此处查看详细信息

scala 如何从 SparkSQL DataFrame 中的 MapType 列中获取键和值

提问by lloydh

回答by zero323

回答by Hailin FU

相关推荐

最近更新

标签

scala 如何从 SparkSQL DataFrame 中的 MapType 列中获取键和值

提问by lloydh

回答by zero323

回答by Hailin FU

相关推荐

scala 如何将 RDD 保存到 HDFS 中并稍后读回？

scala if 语句的否定条件

Scala 中的 toString 函数

scala 如何在 spark 中使用 Regexp_replace

相关推荐

最近更新

标签