如何在 spark scala 中使用自定义分隔符(ctrl-A 分隔)文件编写数据帧/RDD?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48077756/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:31:51  来源:igfitidea点击:

How do you write a dataframe/RDD with custom delimeiter (ctrl-A delimited) file in spark scala?

scalaapache-sparkdataframeapache-spark-sql

提问by Amit

I am working over poc in which I need to create dataframe and then save it as ctrl A delimited file. My query to create intermediate result is below

我正在处理 poc,我需要在其中创建数据帧,然后将其另存为 ctrl 分隔文件。我创建中间结果的查询如下

val grouped = results.groupBy("club_data","student_id_add","student_id").agg(sum(results("amount").cast(IntegerType)).as("amount"),count("amount").as("cnt")).filter((length(trim($"student_id")) > 1) && ($"student_id").isNotNull)

Saving result in text file

将结果保存在文本文件中

grouped.select($"club_data", $"student_id_add", $"amount",$"cnt").rdd.saveAsTextFile("/amit/spark/output4/")

Output :

输出 :

 [amit,DI^A356035,581,1]

It saves data as comma separated but I need to save it as ctrl-A separate I tried option("delimiter", "\u0001") but seems it's not supported by dataframe/rdd.

它将数据保存为逗号分隔,但我需要将其另存为 ctrl-A 单独我试过 option("delimiter", "\u0001") 但似乎数据帧/rdd不支持它。

Is there any function which helps?

有什么功能可以帮助吗?

回答by ktheitroadalo

If you have a dataframe you can use Spark-CSV to write as a csv with delimiter as below.

如果你有一个数据框,你可以使用 Spark-CSV 写成一个带有分隔符的 csv,如下所示。

df.write.mode(SaveMode.Overwrite).option("delimiter", "\u0001").csv("outputCSV")

With Older version of Spark

使用旧版本的 Spark

df.write
    .format("com.databricks.spark.csv")
    .option("delimiter", "\u0001")
    .mode(SaveMode.Overwrite)
    .save("outputCSV")

You can read back as below

您可以阅读如下

spark.read.option("delimiter", "\u0001").csv("outputCSV").show()

IF you have an RDD than you can use mkString()function on RDDand save with saveAsTextFile()

如果你有一个 RDD,那么你可以使用mkString()函数RDD并保存saveAsTextFile()

rdd.map(r => r.mkString(\u0001")).saveAsTextFile("outputCSV")

Hope this helps!

希望这可以帮助!

回答by Ishan Kumar

df.rdd.map(x=>x.mkString("^A")).saveAsTextFile("file:/home/iot/data/stackOver")

回答by Arnon Rotem-Gal-Oz

convert the rows to text before saving:

保存前将行转换为文本:

grouped.select($"club_data", $"student_id_add", $"amount",$"cnt").map(row => row.mkString(\u0001")).saveAsTextFile("/amit/spark/output4/")