scala 如何从命令行或 spark shell 显示拼花文件的方案(包括类型)?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28904856/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 06:56:45  来源:igfitidea点击:

How to show the scheme (including type) of a parquet file from command line or spark shell?

scalaapache-sparkparquet

提问by samthebest

I have determined how to use the spark-shell to show the field names but it's ugly and does not include the type

我已经确定了如何使用 spark-shell 来显示字段名称,但它很难看,并且不包含类型

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

println(sqlContext.parquetFile(path))

prints:

印刷:

ParquetTableScan [cust_id#114,blar_field#115,blar_field2#116], (ParquetRelation /blar/blar), None

回答by BAR

You should be able to do this:

你应该能够做到这一点:

sqlContext.read.parquet(path).printSchema()

From Spark docs:

来自Spark 文档

// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

回答by samthebest

OK I think I have an OK way of doing it, just peek the first row to infer the scheme. (Though not sure just how elegant this is, what if it happens to be empty?? I'm sure there has to be a better solution)

好的,我想我有一个不错的方法,只需查看第一行即可推断出方案。(虽然不确定这有多优雅,但如果它碰巧是空的怎么办?我相信必须有更好的解决方案)

sqlContext.parquetFile(p).first()

At some point prints:

在某些时候打印:

{
  optional binary cust_id;
  optional binary blar;
  optional double foo;
}
 fileSchema: message schema {
  optional binary cust_id;
  optional binary blar;
  optional double foo;
}

回答by sp00n3r

The result of parquetFile() is a SchemaRDD (1.2) or DataFrame (1.3) which have the .printSchema() method.

parquetFile() 的结果是具有 .printSchema() 方法的 SchemaRDD (1.2) 或 DataFrame (1.3)。