scala 如何从spark中的嵌套结构类型中提取列名和数据类型

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42129111/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:04:28  来源:igfitidea点击:

how to extract the column name and data type from nested struct type in spark

scalaapache-spark

提问by mahipal

How to extract the column name and data type from nested struct type in spark

如何从spark中的嵌套结构类型中提取列名和数据类型

schema getting like this:

架构是这样的:

(events,StructType(
   StructField(beaconType,StringType,true),     
   StructField(beaconVersion,StringType,true), 
   StructField(client,StringType,true), 
   StructField(data,StructType(
      StructField(ad,StructType(
         StructField(adId,StringType,true)
      )
   )
)

I want to convert into below format

我想转换成以下格式

Array[(String, String)] = Array(
  (client,StringType), 
  (beaconType,StringType), 
  (beaconVersion,StringType), 
  (phase,StringType)

could you please help on this

你能帮帮忙吗

回答by Tzach Zohar

Question is somewhat unclear, but if you're looking for a way to "flatten" a DataFrame schema (i.e. get an array of all non-struct fields), here's one:

问题有点不清楚,但如果您正在寻找一种“扁平化”DataFrame 模式的方法(即获取所有非结构字段的数组),这里有一个:

def flatten(schema: StructType): Array[StructField] = schema.fields.flatMap { f =>
  f.dataType match {
    case struct: StructType => flatten(struct)
    case _ => Array(f)
  }
}

For example:

例如:

val schema = StructType(Seq(StructField("events", 
  StructType(Seq(
    StructField("beaconVersion", IntegerType, true),
    StructField("client", StringType, true),
    StructField("data", StructType(Seq(
      StructField("ad", StructType(Seq(
        StructField("adId", StringType, true)
      )))
    )))
  )))
))

println(flatten(schema).toList)
// List(StructField(beaconVersion,IntegerType,true), StructField(client,StringType,true), StructField(adId,StringType,true))

回答by Thomas Luechtefeld

If you have a dataframe with a StructTypecolumn i.e.:

如果您有一个带有StructType列的数据框,即:

df.printSchema() 
// root
// |-- data: struct (nullable = true)
// |    |-- embedded_data: string (nullable = true)

You can extract the subfield embedded_dataof the StructTypecolumn dataas follows:

您可以提取子embedded_data中的StructTypedata如下:

df.select("data.embedded_data").printSchema()
// root
// |-- data.embedded_data: string (nullable = true)

回答by user2458922

Let say you have df1 and df2 as two dataFrames, and you want to compare the fields,

假设您将 df1 和 df2 作为两个数据帧,并且您想比较字段,

df1.schema.foreach(schema_1=>{ 
  df2.schema.foreach(schema_2=>{ 
  // **to Compare The names**
  if(schema_1.name.equals(schema_2.name)){ // to Compare The names
          // **comparing the data Type**
           print(schema_1.dataType.equals(schema_2.dataType))

  }
}) 
})