scala 如何使用spark-csv解析使用^A（即\001）作为分隔符的csv？

Question

提问by Daniel Zolnai

Terribly new to spark and hive and big data and scala and all. I'm trying to write a simple function that takes an sqlContext, loads a csv file from s3 and returns a DataFrame. The problem is that this particular csv uses the ^A (i.e. \001) character as the delimiter and the dataset is huge so I can't just do a "s/\001/,/g" on it. Besides, the fields might contain commas or other characters I might use as a delimiter.

对 spark、hive、大数据和 Scala 以及所有的东西来说都是非常新的。我正在尝试编写一个简单的函数，它接受一个 sqlContext，从 s3 加载一个 csv 文件并返回一个 DataFrame。问题是这个特定的 csv 使用 ^A（即 \001）字符作为分隔符，并且数据集很大，所以我不能只对它做一个“s/\001/,/g”。此外，这些字段可能包含逗号或其他我可能用作分隔符的字符。

I know that the spark-csv package that I'm using has a delimiter option, but I don't know how to set it so that it will read \001 as one character and not something like an escaped 0, 0 and 1. Perhaps I should use hiveContext or something?

我知道我使用的 spark-csv 包有一个分隔符选项，但我不知道如何设置它以便它将 \001 读取为一个字符，而不是类似转义的 0、0 和 1。也许我应该使用 hiveContext 之类的？

Answer 1

回答by Daniel Zolnai

If you check the GitHub page, there is a delimiterparameter for spark-csv (as you also noted). Use it like this:

如果您查看 GitHub 页面，则会发现delimiterspark-csv有一个参数（正如您所指出的）。像这样使用它：

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .option("delimiter", "\u0001")
    .load("cars.csv")

Answer 2

回答by Mark Rajcok

With Spark 2.x and the CSV API, use the sepoption:

对于 Spark 2.x 和 CSV API，使用以下sep选项：

val df = spark.read
  .option("sep", "\u0001")
  .csv("path_to_csv_files")

scala 如何使用spark-csv解析使用^A（即\001）作为分隔符的csv？

提问by Daniel Zolnai

回答by Daniel Zolnai

回答by Mark Rajcok

相关推荐

最近更新

标签

scala 如何使用spark-csv解析使用^A（即\001）作为分隔符的csv？

提问by Daniel Zolnai

回答by Daniel Zolnai

回答by Mark Rajcok

相关推荐

scala Slick 3.0 批量插入或更新（upsert）

scala 如何找到 spark RDD/Dataframe 大小？

scala SBT 使用的默认存储库是什么？

将 Java Future 转换为 Scala Future

相关推荐

最近更新

标签