将 Row 转换为 spark Scala 中的映射

Question

提问by Sorin Bolos

I have a row from a data frame and I want to convert it to a Map[String, Any] that maps column names to the values in the row for that column.

我有来自数据框中的一行，我想将其转换为 Map[String, Any] ，该 Map[String, Any] 将列名称映射到该列的行中的值。

Is there an easy way to do it?

有没有简单的方法来做到这一点？

I did it for string values like

我是为字符串值做的，比如

def rowToMap(row:Row): Map[String, String] = {
row.schema.fieldNames.map(field => field -> row.getAs[String](field)).toMap
}

val myRowMap = rowToMap(myRow)

If the row contains other values, not specific ones like String then the code gets messier because the row does not have a a method .get(field)

如果该行包含其他值，而不是像 String 这样的特定值，那么代码会变得更加混乱，因为该行没有 aa 方法 .get(field)

Any ideas?

有任何想法吗？

Answer 1

回答by Psidom

You can use getValuesMap:

您可以使用getValuesMap：

val df = Seq((1, 2.0, "a")).toDF("A", "B", "C")    
val row = df.first

To get Map[String, Any]:

得到Map[String, Any]：

row.getValuesMap[Any](row.schema.fieldNames)
// res19: Map[String,Any] = Map(A -> 1, B -> 2.0, C -> a)

Or you can get Map[String, AnyVal]for this simple case since the values are not complex objects

或者你可以得到Map[String, AnyVal]这个简单的情况，因为值不是复杂的对象

row.getValuesMap[AnyVal](row.schema.fieldNames)
// res20: Map[String,AnyVal] = Map(A -> 1, B -> 2.0, C -> a)

Note: _{the returned value type of the getValuesMapcan be labelled as any type, so you can not rely on it to figure out what data types you have but need to keep in mind what you have from the beginning instead.}

注意：_{的返回值类型getValuesMap可以标记为任何类型，因此您不能依赖它来确定您拥有哪些数据类型，而是需要从一开始就记住您拥有的数据类型。}

Answer 2

回答by Ramesh Maharjan

You can convert your dataframeto rddand use simple mapfunction and use headernamesin the MAPformation inside mapfunction and finally use collect

您可以将您转换dataframe为rdd使用简单的map函数并在函数内部headernames的MAP形成中map使用并最终使用collect

val fn = df.schema.fieldNames
val maps = df.rdd.map(row => fn.map(field => field -> row.getAs(field)).toMap).collect()

Answer 3

回答by Naman Agarwal

Let's say you have a data Frame with these columns:

假设您有一个包含这些列的数据框：

[time(TimeStampType), col1(DoubleType), col2(DoubleType)]

You can do something like this:

你可以这样做：

val modifiedDf = df.map{row => 
    val doubleObject = row.getValuesMap(Seq("col1","col2"))
    val timeObject = Map("time" -> row.getAs[TimeStamp]("time"))
    val map = doubleObject ++ timeObject
}

Answer 4

回答by Schmitzi

Let's say you have a row without structure information and the column header as an array.

假设您有一行没有结构信息，而列标题是一个数组。

val rdd = sc.parallelize( Seq(Row("test1", "val1"), Row("test2", "val2"), Row("test3", "val3"), Row("test4", "val4")) )
rdd.collect.foreach(println)

val sparkFieldNames = Array("col1", "col2")

val mapRDD = rdd.map(
  r => sparkFieldNames.zip(r.toSeq).toMap
)

mapRDD.collect.foreach(println)

将 Row 转换为 spark Scala 中的映射

提问by Sorin Bolos

回答by Psidom

回答by Ramesh Maharjan

回答by Naman Agarwal

回答by Schmitzi

相关推荐

最近更新

标签

将 Row 转换为 spark Scala 中的映射

提问by Sorin Bolos

回答by Psidom

回答by Ramesh Maharjan

回答by Naman Agarwal

回答by Schmitzi

相关推荐

scala 将两列传递给scala中的udf？

使用 Spark Scala 计算平均值

scala Spark SQL 更改数字格式

scala 使用 Boxfuse 将播放框架应用程序部署到 Amazon AWS 时出现“主机不允许”错误

相关推荐

最近更新

标签