scala Parquet 模式和 Spark

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41740499/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:01:34  来源:igfitidea点击:

Parquet schema and Spark

javascalaapache-sparkparquetspark-csv

提问by changepicture

I am trying to convert CSV files to parquet and i am using Spark to accomplish this.

我正在尝试将 CSV 文件转换为镶木地板,我正在使用 Spark 来完成此操作。

SparkSession spark = SparkSession
    .builder()
    .appName(appName)
    .config("spark.master", master)
    .getOrCreate();

Dataset<Row> logFile = spark.read().csv("log_file.csv");
logFile.write().parquet("log_file.parquet");

Now the problem is i don't have a schema defined and columns look like this (output displayed using printSchema() in spark)

现在的问题是我没有定义架构,列看起来像这样(在 spark 中使用 printSchema() 显示的输出)

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 ....

the csv has the names on the first row but they're ignored i guess, the problem is only a few columns are strings, i also have ints and dates.

csv 在第一行有名称,但我猜它们被忽略了,问题是只有几列是字符串,我也有整数和日期。

I am onlyusing Spark, no avro or anything else basically (never used avro).

使用 Spark,没有 avro 或其他任何东西(从未使用过 avro)。

What are my options to define a schema and how? If i need to write the parquet file in another way then no problem as long as it's a quick an easy solution.

我有哪些定义模式的选项以及如何定义?如果我需要以另一种方式编写镶木地板文件,那么只要它是一个快速简单的解决方案就没有问题。

(i am using spark standalone for tests / don't know scala)

(我使用 spark 独立进行测试/不知道 scala)

回答by Rajat Mishra

Try using the .option("inferschema","true") present Spark-csvpackage. This will automatically infer the schema from the data.

尝试使用 .option("inferschema","true") 当前Spark-csv包。这将自动从数据推断架构。

You can also define a custom schema for your data using struct type and use the .schema(schema_name)to read the on the basis of a custom schema.

您还可以使用 struct 类型为您的数据定义自定义架构,并.schema(schema_name)在自定义架构的基础上使用来读取 。

val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
    StructField("year", IntegerType, true),
    StructField("make", StringType, true),
    StructField("model", StringType, true),
    StructField("comment", StringType, true),
    StructField("blank", StringType, true)))

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .schema(customSchema)
    .load("cars.csv")