scala 如何从spark中的嵌套结构类型中提取列名和数据类型
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/42129111/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to extract the column name and data type from nested struct type in spark
提问by mahipal
How to extract the column name and data type from nested struct type in spark
如何从spark中的嵌套结构类型中提取列名和数据类型
schema getting like this:
架构是这样的:
(events,StructType(
   StructField(beaconType,StringType,true),     
   StructField(beaconVersion,StringType,true), 
   StructField(client,StringType,true), 
   StructField(data,StructType(
      StructField(ad,StructType(
         StructField(adId,StringType,true)
      )
   )
)
I want to convert into below format
我想转换成以下格式
Array[(String, String)] = Array(
  (client,StringType), 
  (beaconType,StringType), 
  (beaconVersion,StringType), 
  (phase,StringType)
could you please help on this
你能帮帮忙吗
回答by Tzach Zohar
Question is somewhat unclear, but if you're looking for a way to "flatten" a DataFrame schema (i.e. get an array of all non-struct fields), here's one:
问题有点不清楚,但如果您正在寻找一种“扁平化”DataFrame 模式的方法(即获取所有非结构字段的数组),这里有一个:
def flatten(schema: StructType): Array[StructField] = schema.fields.flatMap { f =>
  f.dataType match {
    case struct: StructType => flatten(struct)
    case _ => Array(f)
  }
}
For example:
例如:
val schema = StructType(Seq(StructField("events", 
  StructType(Seq(
    StructField("beaconVersion", IntegerType, true),
    StructField("client", StringType, true),
    StructField("data", StructType(Seq(
      StructField("ad", StructType(Seq(
        StructField("adId", StringType, true)
      )))
    )))
  )))
))
println(flatten(schema).toList)
// List(StructField(beaconVersion,IntegerType,true), StructField(client,StringType,true), StructField(adId,StringType,true))
回答by Thomas Luechtefeld
If you have a dataframe with a StructTypecolumn i.e.:
如果您有一个带有StructType列的数据框,即:
df.printSchema() 
// root
// |-- data: struct (nullable = true)
// |    |-- embedded_data: string (nullable = true)
You can extract the subfield embedded_dataof the StructTypecolumn dataas follows:
您可以提取子embedded_data中的StructType列data如下:
df.select("data.embedded_data").printSchema()
// root
// |-- data.embedded_data: string (nullable = true)
回答by user2458922
Let say you have df1 and df2 as two dataFrames, and you want to compare the fields,
假设您将 df1 和 df2 作为两个数据帧,并且您想比较字段,
df1.schema.foreach(schema_1=>{ 
  df2.schema.foreach(schema_2=>{ 
  // **to Compare The names**
  if(schema_1.name.equals(schema_2.name)){ // to Compare The names
          // **comparing the data Type**
           print(schema_1.dataType.equals(schema_2.dataType))
  }
}) 
})

