scala 使用镶木地板文件元数据创建配置单元表

Question

提问by WoodChopper

I wrote a DataFrame as parquet file. And, I would like to read the file using Hive using the metadata from parquet.

我写了一个 DataFrame 作为镶木地板文件。而且，我想使用来自 parquet 的元数据使用 Hive 读取文件。

Output from writing parquet write

写拼花写的输出

_common_metadata  part-r-00000-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet  part-r-00002-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet  _SUCCESS
_metadata         part-r-00001-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet  part-r-00003-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet

Hive table

蜂巢表

CREATE  TABLE testhive
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '/home/gz_files/result';



FAILED: SemanticException [Error 10043]: Either list of columns or a custom serializer should be specified

How can I infer the meta data from parquet file?

如何从镶木地板文件中推断元数据？

If I open the _common_metadataI have below content,

如果我打开_common_metadata我有以下内容，

PAR1LHroot
%TSN%
%TS%
%Etype%
)org.apache.spark.sql.parquet.row.metadata?{"type":"struct","fields":[{"name":"TSN","type":"string","nullable":true,"metadata":{}},{"name":"TS","type":"string","nullable":true,"metadata":{}},{"name":"Etype","type":"string","nullable":true,"metadata":{}}]}

Or how to parse meta data file?

或者如何解析元数据文件？

Answer 1

回答by James Tobin

Here's a solution I've come up with to get the metadata from parquet files in order to create a Hive table.

这是我想出的从镶木地板文件中获取元数据以创建 Hive 表的解决方案。

First start a spark-shell (Or compile it all into a Jar and run it with spark-submit, but the shell is SOO much easier)

首先启动一个 spark-shell（或者将它全部编译成一个 Jar 并使用 spark-submit 运行它，但是 shell 更容易）

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.DataFrame


val df=sqlContext.parquetFile("/path/to/_common_metadata")

def creatingTableDDL(tableName:String, df:DataFrame): String={
  val cols = df.dtypes
  var ddl1 = "CREATE EXTERNAL TABLE "+tableName + " ("
  //looks at the datatypes and columns names and puts them into a string
  val colCreate = (for (c <-cols) yield(c._1+" "+c._2.replace("Type",""))).mkString(", ")
  ddl1 += colCreate + ") STORED AS PARQUET LOCATION '/wherever/you/store/the/data/'"
  ddl1
}

val test_tableDDL=creatingTableDDL("test_table",df,"test_db")

It will provide you with the datatypes that Hive will use for each column as they are stored in Parquet. E.G: CREATE EXTERNAL TABLE test_table (COL1 Decimal(38,10), COL2 String, COL3 Timestamp) STORED AS PARQUET LOCATION '/path/to/parquet/files'

它将为您提供 Hive 将用于每一列的数据类型，因为它们存储在 Parquet 中。例如： CREATE EXTERNAL TABLE test_table (COL1 Decimal(38,10), COL2 String, COL3 Timestamp) STORED AS PARQUET LOCATION '/path/to/parquet/files'

Answer 2

回答by Victor Lam

I'd just like to expand on James Tobin's answer. There's a StructField class which provides Hive's data types without doing string replacements.

我只想扩展詹姆斯托宾的回答。有一个 StructField 类，它提供 Hive 的数据类型，而无需进行字符串替换。

// Tested on Spark 1.6.0.

import org.apache.spark.sql.DataFrame

def dataFrameToDDL(dataFrame: DataFrame, tableName: String): String = {
    val columns = dataFrame.schema.map { field =>
        "  " + field.name + " " + field.dataType.simpleString.toUpperCase
    }

    s"CREATE TABLE $tableName (\n${columns.mkString(",\n")}\n)"
}

This solves the IntegerType problem.

这解决了 IntegerType 问题。

scala> val dataFrame = sc.parallelize(Seq((1, "a"), (2, "b"))).toDF("x", "y")
dataFrame: org.apache.spark.sql.DataFrame = [x: int, y: string]

scala> print(dataFrameToDDL(dataFrame, "t"))
CREATE TABLE t (
  x INT,
  y STRING
)

This should work with any DataFrame, not just with Parquet. (e.g., I'm using this with a JDBC DataFrame.)

这应该适用于任何 DataFrame，而不仅仅是 Parquet。（例如，我将它与 JDBC DataFrame 一起使用。）

As an added bonus, if your target DDL supports nullable columns, you can extend the function by checking StructField.nullable.

作为一个额外的好处，如果您的目标 DDL 支持可为空的列，您可以通过检查StructField.nullable.

Answer 3

回答by Jordi Atserias

A small improvement over Victor (adding quotes on field.name) and modified to bind the table to a local parquet file (tested on spark 1.6.1)

对 Victor 的小改进（在 field.name 上添加引号）并修改为将表绑定到本地镶木地板文件（在 spark 1.6.1 上测试）

def dataFrameToDDL(dataFrame: DataFrame, tableName: String, absFilePath: String): String = {
    val columns = dataFrame.schema.map { field =>
      "  `" + field.name + "` " + field.dataType.simpleString.toUpperCase
    }
    s"CREATE EXTERNAL TABLE $tableName (\n${columns.mkString(",\n")}\n) STORED AS PARQUET LOCATION '"+absFilePath+"'"
  }

Also notice that:

还要注意：

A HiveContext is needed since SQLContext does not support creating external table.
The path to the parquet folder must be an absolute path

由于 SQLContext 不支持创建外部表，因此需要 HiveContext。
parquet 文件夹的路径必须是绝对路径

Answer 4

回答by Maheshbabu Vanamala

I would like to expand James answer,

我想扩大詹姆斯的回答，

The following code will work for all datatypes including ARRAY, MAP and STRUCT.

以下代码适用于所有数据类型，包括 ARRAY、MAP 和 STRUCT。

Have tested in SPARK 2.2

已在 SPARK 2.2 中测试

val df=sqlContext.parquetFile("parquetFilePath")
val schema = df.schema
var columns = schema.fields
var ddl1 = "CREATE EXTERNAL TABLE " tableName + " ("
val cols=(for(column <- columns) yield column.name+" "+column.dataType.sql).mkString(",")
ddl1=ddl1+cols+" ) STORED AS PARQUET LOCATION '/tmp/hive_test1/'"
spark.sql(ddl1)

Answer 5

回答by Tagar

I had the same question. It might be hard to implement from pratcical side though, as Parquet supports schema evolution:

我有同样的问题。不过，从实用的角度来看可能很难实现，因为 Parquet 支持模式演变：

http://www.cloudera.com/content/www/en-us/documentation/archive/impala/2-x/2-0-x/topics/impala_parquet.html#parquet_schema_evolution_unique_1

For example, you could add a new column to your table and you don't have to touch data that's already in the table. It's only new datafiles will have new metadata (compatible with previous version).

例如，您可以向表格中添加一个新列，而不必接触表格中已有的数据。只有新的数据文件才会有新的元数据（与以前的版本兼容）。

Schema merging is switched off by default since Spark 1.5.0 since it is "relatively expensive operation" http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-mergingSo infering most recent schema may not be as simple as it sounds. Although quick-and-dirty approaches are quite possible e.g. by parsing output from

自 Spark 1.5.0 以来，模式合并默认关闭，因为它是“相对昂贵的操作” http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging所以推断最新的模式可能并不像听起来那么简单。尽管快速而肮脏的方法是很有可能的，例如通过解析来自

$ parquet-tools schema /home/gz_files/result/000000_0

Answer 6

回答by Tagar

Actually, Impala supports

实际上，Impala 支持

CREATE TABLE LIKE PARQUET

(no columns section altogether):

（完全没有列部分）：

http://www.cloudera.com/content/www/en-us/documentation/archive/impala/2-x/2-1-x/topics/impala_create_table.html

Tags of your question have "hive" and "spark" and I don't see this is implemented in Hive, but in case you use CDH, it may be what you were looking for.

您问题的标签有“hive”和“spark”，我没有看到 Hive 中实现了这一点，但是如果您使用 CDH，它可能就是您要寻找的。

scala 使用镶木地板文件元数据创建配置单元表

提问by WoodChopper

回答by James Tobin

回答by Victor Lam

回答by Jordi Atserias

回答by Maheshbabu Vanamala

回答by Tagar

回答by Tagar

相关推荐

最近更新

标签

scala 使用镶木地板文件元数据创建配置单元表

提问by WoodChopper

回答by James Tobin

回答by Victor Lam

回答by Jordi Atserias

回答by Maheshbabu Vanamala

回答by Tagar

回答by Tagar

相关推荐

scala 我可以从 Spark 程序而不是从 RDD 编写纯文本 HDFS（或本地）文件吗？

如何在同一个 Spark 项目中同时使用 Scala 和 Python？

scala Spark 从一行中提取值

scala 如何迭代记录火花Scala？

相关推荐

最近更新

标签