scala 如何使用spark-csv解析使用^A(即\001)作为分隔符的csv?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36007686/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to parse a csv that uses ^A (i.e. \001) as the delimiter with spark-csv?
提问by Daniel Zolnai
Terribly new to spark and hive and big data and scala and all. I'm trying to write a simple function that takes an sqlContext, loads a csv file from s3 and returns a DataFrame. The problem is that this particular csv uses the ^A (i.e. \001) character as the delimiter and the dataset is huge so I can't just do a "s/\001/,/g" on it. Besides, the fields might contain commas or other characters I might use as a delimiter.
对 spark、hive、大数据和 Scala 以及所有的东西来说都是非常新的。我正在尝试编写一个简单的函数,它接受一个 sqlContext,从 s3 加载一个 csv 文件并返回一个 DataFrame。问题是这个特定的 csv 使用 ^A(即 \001)字符作为分隔符,并且数据集很大,所以我不能只对它做一个“s/\001/,/g”。此外,这些字段可能包含逗号或其他我可能用作分隔符的字符。
I know that the spark-csv package that I'm using has a delimiter option, but I don't know how to set it so that it will read \001 as one character and not something like an escaped 0, 0 and 1. Perhaps I should use hiveContext or something?
我知道我使用的 spark-csv 包有一个分隔符选项,但我不知道如何设置它以便它将 \001 读取为一个字符,而不是类似转义的 0、0 和 1。也许我应该使用 hiveContext 之类的?
回答by Daniel Zolnai
If you check the GitHub page, there is a delimiterparameter for spark-csv (as you also noted).
Use it like this:
如果您查看 GitHub 页面,则会发现delimiterspark-csv有一个参数(正如您所指出的)。像这样使用它:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.option("delimiter", "\u0001")
.load("cars.csv")
回答by Mark Rajcok
With Spark 2.x and the CSV API, use the sepoption:
对于 Spark 2.x 和 CSV API,使用以下sep选项:
val df = spark.read
.option("sep", "\u0001")
.csv("path_to_csv_files")

