scala 使用数据框架构的 Spark 地图数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37485536/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:19:54  来源:igfitidea点击:

Spark map dataframe using the dataframe's schema

scalaapache-sparkapache-spark-sql

提问by Havnar

I have a dataframe, created from a JSON object. I can query this dataframe and write it to parquet.

我有一个从 JSON 对象创建的数据框。我可以查询此数据框并将其写入镶木地板。

Since I infer the schema, I don't necessarily know what's in the dataframe.

由于我推断了架构,因此我不一定知道数据框中的内容。

Is there a way to the the column names out or map the dataframe using its own schema?

有没有办法使用自己的模式将列名输出或映射数据框?

// The results of SQL queries are DataFrames and support all the normal  RDD operations.
// The columns of a row in the result can be accessed by field index:
df.map(t => "Name: " + t(0)).collect().foreach(println)

// or by field name:
df.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
df.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
// Map("name" -> "Justin", "age" -> 19)

I would want to do something like

我想做类似的事情

df.map (_.getValuesMap[Any](ListAll())).collect().foreach(println)
// Map ("name" -> "Justin", "age" -> 19, "color" -> "red")

without knowing the actual amount or names of the columns.

不知道列的实际数量或名称。

回答by zero323

Well, you can but result is rather useless:

好吧,你可以,但结果是无用的:

val df = Seq(("Justin", 19, "red")).toDF("name", "age", "color")

def getValues(row: Row, names: Seq[String]) = names.map(
  name => name -> row.getAs[Any](name)
).toMap

val names = df.columns
df.rdd.map(getValues(_, names)).first

// scala.collection.immutable.Map[String,Any] = 
//   Map(name -> Justin, age -> 19, color -> red)

To get something actually useful one would a proper mapping between SQL types and Scala types. It is not hard in simple cases but it is hard in general. For example there is built-in type which can be used to represent an arbitrary struct. This can be done using a little bit of meta-programming but arguably it is not worth all the fuss.

要获得真正有用​​的东西,需要在 SQL 类型和 Scala 类型之间进行适当的映射。在简单的情况下并不难,但在一般情况下很难。例如,有内置类型可用于表示任意struct. 这可以使用一点元编程来完成,但可以说它不值得大惊小怪。

回答by Nir Hedvat

You could use an implicit Encoder and perform the map on the DataFrame itself:

您可以使用隐式编码器并在 DataFrame 本身上执行映射:

implicit class DataFrameEnhancer(df: DataFrame) extends Serializable {
    implicit val encoder = RowEncoder(df.schema)

    implicit def mapNameAndAge(): DataFrame = {
       df.map(row => (row.getAs[String]("name") -> row.getAs[Int]("age")))
    }
}

And invoke it on your dataframe as such:

并在您的数据帧上调用它,如下所示:

val df = Seq(("Justin", 19, "red")).toDF("name", "age", "color")
df.mapNameAndAge().first

That way, you don't have to convert your DataFrame into an RDD (in some cases, you don't want to load the entire DF from the disk, just some columns, but the RDD conversion forces you into doing that anyway. Plus, you're using Encoder instead of Kryo (or other Java SerDes), much faster.

这样,您就不必将 DataFrame 转换为 RDD(在某些情况下,您不想从磁盘加载整个 DF,只是一些列,但无论如何 RDD 转换都会迫使您这样做。另外,您使用的是 Encoder 而不是 Kryo(或其他 Java SerDes),速度要快得多。

Hope it helps :-)

希望能帮助到你 :-)