scala 如何从spark中的嵌套结构类型中提取列名和数据类型

Question

提问by mahipal

How to extract the column name and data type from nested struct type in spark

如何从spark中的嵌套结构类型中提取列名和数据类型

schema getting like this:

架构是这样的：

(events,StructType(
   StructField(beaconType,StringType,true),     
   StructField(beaconVersion,StringType,true), 
   StructField(client,StringType,true), 
   StructField(data,StructType(
      StructField(ad,StructType(
         StructField(adId,StringType,true)
      )
   )
)

I want to convert into below format

我想转换成以下格式

Array[(String, String)] = Array(
  (client,StringType), 
  (beaconType,StringType), 
  (beaconVersion,StringType), 
  (phase,StringType)

could you please help on this

你能帮帮忙吗

Answer 1

回答by Tzach Zohar

Question is somewhat unclear, but if you're looking for a way to "flatten" a DataFrame schema (i.e. get an array of all non-struct fields), here's one:

问题有点不清楚，但如果您正在寻找一种“扁平化”DataFrame 模式的方法（即获取所有非结构字段的数组），这里有一个：

def flatten(schema: StructType): Array[StructField] = schema.fields.flatMap { f =>
  f.dataType match {
    case struct: StructType => flatten(struct)
    case _ => Array(f)
  }
}

For example:

例如：

val schema = StructType(Seq(StructField("events", 
  StructType(Seq(
    StructField("beaconVersion", IntegerType, true),
    StructField("client", StringType, true),
    StructField("data", StructType(Seq(
      StructField("ad", StructType(Seq(
        StructField("adId", StringType, true)
      )))
    )))
  )))
))

println(flatten(schema).toList)
// List(StructField(beaconVersion,IntegerType,true), StructField(client,StringType,true), StructField(adId,StringType,true))

Answer 2

回答by Thomas Luechtefeld

If you have a dataframe with a StructTypecolumn i.e.:

如果您有一个带有StructType列的数据框，即：

df.printSchema() 
// root
// |-- data: struct (nullable = true)
// |    |-- embedded_data: string (nullable = true)

You can extract the subfield embedded_dataof the StructTypecolumn dataas follows:

您可以提取子embedded_data中的StructType列data如下：

df.select("data.embedded_data").printSchema()
// root
// |-- data.embedded_data: string (nullable = true)

Answer 3

回答by user2458922

Let say you have df1 and df2 as two dataFrames, and you want to compare the fields,

假设您将 df1 和 df2 作为两个数据帧，并且您想比较字段，

df1.schema.foreach(schema_1=>{ 
  df2.schema.foreach(schema_2=>{ 
  // **to Compare The names**
  if(schema_1.name.equals(schema_2.name)){ // to Compare The names
          // **comparing the data Type**
           print(schema_1.dataType.equals(schema_2.dataType))

  }
}) 
})

scala 如何从spark中的嵌套结构类型中提取列名和数据类型

提问by mahipal

回答by Tzach Zohar

回答by Thomas Luechtefeld

回答by user2458922

相关推荐

最近更新

标签

scala 如何从spark中的嵌套结构类型中提取列名和数据类型

提问by mahipal

回答by Tzach Zohar

回答by Thomas Luechtefeld

回答by user2458922

相关推荐

Spark/Scala 在多列上使用相同的函数重复调用 withColumn()

scala 在 spark.sql 中使用 group by 选择多个元素

scala 如何在 Spark 中找到分组数据的确切中位数

scala 在spark中解析json

相关推荐

最近更新

标签