scala Parquet 模式和 Spark

Question

提问by changepicture

I am trying to convert CSV files to parquet and i am using Spark to accomplish this.

我正在尝试将 CSV 文件转换为镶木地板，我正在使用 Spark 来完成此操作。

SparkSession spark = SparkSession
    .builder()
    .appName(appName)
    .config("spark.master", master)
    .getOrCreate();

Dataset<Row> logFile = spark.read().csv("log_file.csv");
logFile.write().parquet("log_file.parquet");

Now the problem is i don't have a schema defined and columns look like this (output displayed using printSchema() in spark)

现在的问题是我没有定义架构，列看起来像这样（在 spark 中使用 printSchema() 显示的输出）

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 ....

the csv has the names on the first row but they're ignored i guess, the problem is only a few columns are strings, i also have ints and dates.

csv 在第一行有名称，但我猜它们被忽略了，问题是只有几列是字符串，我也有整数和日期。

I am onlyusing Spark, no avro or anything else basically (never used avro).

我只使用 Spark，没有 avro 或其他任何东西（从未使用过 avro）。

What are my options to define a schema and how? If i need to write the parquet file in another way then no problem as long as it's a quick an easy solution.

我有哪些定义模式的选项以及如何定义？如果我需要以另一种方式编写镶木地板文件，那么只要它是一个快速简单的解决方案就没有问题。

(i am using spark standalone for tests / don't know scala)

（我使用 spark 独立进行测试/不知道 scala）

Answer 1

回答by Rajat Mishra

Try using the .option("inferschema","true") present Spark-csvpackage. This will automatically infer the schema from the data.

尝试使用 .option("inferschema","true") 当前Spark-csv包。这将自动从数据推断架构。

You can also define a custom schema for your data using struct type and use the .schema(schema_name)to read the on the basis of a custom schema.

您还可以使用 struct 类型为您的数据定义自定义架构，并.schema(schema_name)在自定义架构的基础上使用来读取。

val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
    StructField("year", IntegerType, true),
    StructField("make", StringType, true),
    StructField("model", StringType, true),
    StructField("comment", StringType, true),
    StructField("blank", StringType, true)))

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .schema(customSchema)
    .load("cars.csv")

scala Parquet 模式和 Spark

提问by changepicture

回答by Rajat Mishra

相关推荐

最近更新

标签

scala Parquet 模式和 Spark

提问by changepicture

回答by Rajat Mishra

相关推荐

scala Await.ready 和 Await.result 的区别

无法在 Scala 中将字符串转换为 long

如何在终端中运行 Scala 程序？

scala Spark Dataframes - 按键减少

相关推荐

最近更新

标签