scala spark中的saveAsTextFile方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27718325/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 06:48:30  来源:igfitidea点击:

saveAsTextFile method in spark

scalaapache-spark

提问by kemiya

In my project, I have three input files and make the file names as args(0) to args(2), I also have a output filename as args(3), in the source code, I use

在我的项目中,我有三个输入文件,并将文件名设为 args(0) 到 args(2),我还有一个输出文件名作为 args(3),在源代码中,我使用

val sc = new SparkContext()
var log = sc.textFile(args(0))
for(i <- 1 until args.size - 1) log = log.union(sc.textFile(args(i)))

I do nothing to the log but save it as a text file by using

我对日志不做任何处理,但使用以下命令将其保存为文本文件

log.coalesce(1, true).saveAsTextFile(args(args.size - 1))

but it still save to 3 file as part-00000、part-00001、part-00002, So is there any way that I can save the three input files to an output file?

但它仍然以part-00000、part-00001、part-00002保存到3个文件,那么有什么办法可以将三个输入文件保存到输出文件中?

回答by xhudik

Having multiple output files is a standard behavior of multi-machine clusters like Hadoop or Spark. The number of output files depends on the number of reducers.

拥有多个输出文件是 Hadoop 或 Spark 等多机集群的标准行为。输出文件的数量取决于减速器的数量。

How to "solve" it in Hadoop: merge output files after reduce phase

如何在 Hadoop 中“解决”它: 在减少阶段后合并输出文件

How to "solve" in Spark: how to make saveAsTextFile NOT split output into multiple file?

如何在 Spark 中“解决”: 如何使 saveAsTextFile 不将输出拆分为多个文件?

A good info you can get also here: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Spark-merge-the-output-file-td322.html

您也可以在这里获得一个很好的信息:http: //apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Spark-merge-the-output-file-td322.html

So, you were right about coalesce(1,true). However, it is very inefficient. Interesting is that (as @climbage mentioned in his remark) your code is working if you run it locally.

所以,你是对的coalesce(1,true)。然而,这是非常低效的。有趣的是(正如@climbage 在他的评论中提到的)如果你在本地运行你的代码就可以工作。

What you might try is to read the files first and then save the output.

您可能会尝试先读取文件,然后保存输出。

...
val sc = new SparkContext()
var str = new String("")
for(i <- 0 until args.size - 1){
   val file = sc.textFile(args(i))       
   file.foreach(line => str+= line)
}
//and now you might save the content
str.coalesce(1, true).saveAsTextFile("out")

Note: this code is also extremely inefficient and working for small files only!!! You need to come up with a better code. I wouldn't try to reduce number of file but process multiple outputs files instead.

注意:此代码也非常低效,仅适用于小文件!!!你需要想出一个更好的代码。我不会尝试减少文件数量,而是处理多个输出文件。

回答by Steve

As mentioned your problem is somewhat unavoidable via the standard API's as the assumption is that you are dealing with large quanatities of data. However, if I assume your data is manageable you could try the following

如前所述,您的问题在某种程度上不可避免地通过标准 API 来解决,因为假设您正在处理大量数据。但是,如果我假设您的数据是可管理的,您可以尝试以下操作

import java.nio.file.{Paths, Files}    
import java.nio.charset.StandardCharsets

Files.write(Paths.get("./test_file"), data.collect.mkString("\n").getBytes(StandardCharsets.UTF_8))

What I am doing here is converting the RDD into a String by performing a collect and then mkString. I would suggest not doing this in production. It works fine for local data analysis (Working with 5gb~ of local data)

我在这里做的是通过执行收集然后执行 mkString 将 RDD 转换为字符串。我建议不要在生产中这样做。它适用于本地数据分析(使用 5GB~ 本地数据)