scala 使用数据框架构的 Spark 地图数据框

Question

提问by Havnar

I have a dataframe, created from a JSON object. I can query this dataframe and write it to parquet.

我有一个从 JSON 对象创建的数据框。我可以查询此数据框并将其写入镶木地板。

Since I infer the schema, I don't necessarily know what's in the dataframe.

由于我推断了架构，因此我不一定知道数据框中的内容。

Is there a way to the the column names out or map the dataframe using its own schema?

有没有办法使用自己的模式将列名输出或映射数据框？

// The results of SQL queries are DataFrames and support all the normal  RDD operations.
// The columns of a row in the result can be accessed by field index:
df.map(t => "Name: " + t(0)).collect().foreach(println)

// or by field name:
df.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
df.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
// Map("name" -> "Justin", "age" -> 19)

I would want to do something like

我想做类似的事情

df.map (_.getValuesMap[Any](ListAll())).collect().foreach(println)
// Map ("name" -> "Justin", "age" -> 19, "color" -> "red")

without knowing the actual amount or names of the columns.

不知道列的实际数量或名称。

Answer 1

回答by zero323

Well, you can but result is rather useless:

好吧，你可以，但结果是无用的：

val df = Seq(("Justin", 19, "red")).toDF("name", "age", "color")

def getValues(row: Row, names: Seq[String]) = names.map(
  name => name -> row.getAs[Any](name)
).toMap

val names = df.columns
df.rdd.map(getValues(_, names)).first

// scala.collection.immutable.Map[String,Any] = 
//   Map(name -> Justin, age -> 19, color -> red)

To get something actually useful one would a proper mapping between SQL types and Scala types. It is not hard in simple cases but it is hard in general. For example there is built-in type which can be used to represent an arbitrary struct. This can be done using a little bit of meta-programming but arguably it is not worth all the fuss.

要获得真正有用的东西，需要在 SQL 类型和 Scala 类型之间进行适当的映射。在简单的情况下并不难，但在一般情况下很难。例如，有内置类型可用于表示任意struct. 这可以使用一点元编程来完成，但可以说它不值得大惊小怪。

Answer 2

回答by Nir Hedvat

You could use an implicit Encoder and perform the map on the DataFrame itself:

您可以使用隐式编码器并在 DataFrame 本身上执行映射：

implicit class DataFrameEnhancer(df: DataFrame) extends Serializable {
    implicit val encoder = RowEncoder(df.schema)

    implicit def mapNameAndAge(): DataFrame = {
       df.map(row => (row.getAs[String]("name") -> row.getAs[Int]("age")))
    }
}

And invoke it on your dataframe as such:

并在您的数据帧上调用它，如下所示：

val df = Seq(("Justin", 19, "red")).toDF("name", "age", "color")
df.mapNameAndAge().first

That way, you don't have to convert your DataFrame into an RDD (in some cases, you don't want to load the entire DF from the disk, just some columns, but the RDD conversion forces you into doing that anyway. Plus, you're using Encoder instead of Kryo (or other Java SerDes), much faster.

这样，您就不必将 DataFrame 转换为 RDD（在某些情况下，您不想从磁盘加载整个 DF，只是一些列，但无论如何 RDD 转换都会迫使您这样做。另外，您使用的是 Encoder 而不是 Kryo（或其他 Java SerDes），速度要快得多。

Hope it helps :-)

希望能帮助到你：-）

scala 使用数据框架构的 Spark 地图数据框

提问by Havnar

回答by zero323

回答by Nir Hedvat

相关推荐

最近更新

标签

scala 使用数据框架构的 Spark 地图数据框

提问by Havnar

回答by zero323

回答by Nir Hedvat

相关推荐

scala Spark SQL：如何将新行附加到数据帧表（来自另一个表）

scala 使用 Spark 通过 s3a 将镶木地板文件写入 s3 非常慢

scala 发送 FakeRequest 时如何为 akka.stream.Materializer 提供隐式值？

scala 如何在 DataFrames 中将列类型从 String 更改为 Date？

相关推荐

最近更新

标签