scala 在集群模式下使用 Spark 将文件写入本地系统

Question

提问by tkrhgch

I know this is a weird way of using Spark but I'm trying to save a dataframe to the local file system (not hdfs) using Spark even though I'm in cluster mode. I know I can use client modebut I dowant to run in cluster modeand don't care which node (out of 3) the application is going to run on as driver. The code below is the pseudo code of what I'm trying to do.

我知道这是使用 Spark 的一种奇怪方式，但我正在尝试使用 Spark 将数据帧保存到本地文件系统（不是 hdfs），即使我在cluster mode. 我知道我可以使用，client mode但我确实想运行cluster mode并且不关心应用程序将作为驱动程序运行在哪个节点（3 个节点中）。下面的代码是我正在尝试做的伪代码。

// create dataframe
val df = Seq(Foo("John", "Doe"), Foo("Jane", "Doe")).toDF()
// save it to the local file system using 'file://' because it defaults to hdfs://
df.coalesce(1).rdd.saveAsTextFile(s"file://path/to/file")

And this is how I'm submitting the spark application.

这就是我提交 spark 申请的方式。

spark-submit --class sample.HBaseSparkRSample --master yarn-cluster hbase-spark-r-sample-assembly-1.0.jar

This works fine if I'm in local modebut doesn't in yarn-cluster mode.

如果我在local mode但不在yarn-cluster mode.

For example, java.io.IOException: Mkdirs failed to create fileoccurs with the above code.

例如，java.io.IOException: Mkdirs failed to create file发生在上面的代码中。

I've changed the df.coalesce(1)part to df.collectand attempted to save a file using plain Scala but it ended up with a Permission denied.

我已将df.coalesce(1)部分更改为df.collect并尝试使用普通 Scala 保存文件，但最终以Permission denied.

I've also tried:

我也试过：

spark-submitwith rootuser
chowned yarn:yarn, yarn:hadoop, spark:spark
gave chmod 777to related directories

spark-submit与root用户
chownED yarn:yarn，yarn:hadoop，spark:spark
给chmod 777相关目录

but no luck.

但没有运气。

I'm assuming this has to do something with clusters, drivers and executors, and the userwho's trying to write to the local file system but am pretty much stuck in solving this problem by myself.

我假设这与clusters,drivers and executors和user试图写入本地文件系统的人有关，但我自己几乎无法解决这个问题。

I'm using:

我正在使用：

Spark: 1.6.0-cdh5.8.2
Scala: 2.10.5
Hadoop: 2.6.0-cdh5.8.2

火花：1.6.0-cdh5.8.2
斯卡拉：2.10.5
Hadoop：2.6.0-cdh5.8.2

Any support is welcome and thanks in advance.

欢迎任何支持，并提前致谢。

Some articles I've tried:

我试过的一些文章：

"Spark saveAsTextFile() results in Mkdirs failed to create for half of the directory" -> Tried changing users but nothing changed
"Failed to save RDD as text file to local file system" -> chmoddidn't help me

“Spark saveAsTextFile() 结果导致 Mkdirs 无法为一半的目录创建”-> 尝试更改用户但没有任何更改
“无法将 RDD 作为文本文件保存到本地文件系统”->chmod没有帮助我

Edited (2016/11/25)

已编辑 (2016/11/25)

This is the Exception I get.

这是我得到的异常。

java.io.IOException: Mkdirs failed to create file:/home/foo/work/rhbase/r/input/input.csv/_temporary/0/_temporary/attempt_201611242024_0000_m_000000_0 (exists=false, cwd=file:/yarn/nm/usercache/foo/appcache/application_1478068613528_0143/container_e87_1478068613528_0143_01_000001)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:449)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:435)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:920)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:813)
    at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
    at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$$anonfun.apply(PairRDDFunctions.scala:1193)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$$anonfun.apply(PairRDDFunctions.scala:1185)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
16/11/24 20:24:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.io.IOException: Mkdirs failed to create file:/home/foo/work/rhbase/r/input/input.csv/_temporary/0/_temporary/attempt_201611242024_0000_m_000000_0 (exists=false, cwd=file:/yarn/nm/usercache/foo/appcache/application_1478068613528_0143/container_e87_1478068613528_0143_01_000001)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:449)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:435)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:920)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:813)
    at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
    at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$$anonfun.apply(PairRDDFunctions.scala:1193)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$$anonfun.apply(PairRDDFunctions.scala:1185)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Answer 1

回答by tkrhgch

I'm going to answer my own question because eventually, none of the answers didn't seem to solve my problem. None the less, thanks for all the answers and pointing me to alternatives I can check.

我将回答我自己的问题，因为最终，没有一个答案似乎不能解决我的问题。尽管如此，感谢所有的答案，并指出我可以检查的替代方案。

I think @Ricardo was the closest in mentioning the user of the Spark application. I checked whoamiwith Process("whoami")and the user was yarn. The problem was probably that I tried to output to /home/foo/work/rhbase/r/input/input.csvand although /home/foo/work/rhbasewas owned by yarn:yarn, /home/foowas owned by foo:foo. I haven't checked in detail but this may have been the cause of this permissionproblem.

我认为@Ricardo 最接近于提到 Spark 应用程序的用户。我查whoami了Process("whoami")一下，用户是yarn. 问题可能是我试图输出到/home/foo/work/rhbase/r/input/input.csv并且虽然/home/foo/work/rhbase由拥有yarn:yarn，但/home/foo由foo:foo. 我没有详细检查，但这可能是导致此permission问题的原因。

When I hit pwdin my Spark application with Process("pwd"), it output /yarn/path/to/somewhere. So I decided to output my file to /yarn/input.csvand it was successful despite in cluster mode.

当我用打入pwd我的 Spark 应用程序时Process("pwd")，它输出/yarn/path/to/somewhere. 所以我决定将我的文件输出到/yarn/input.csv，尽管在cluster mode.

I probably can conclude that this was just a simple permission issue. Any further solution would be welcome but for now, this was the way how I solved this question.

我可能可以得出结论，这只是一个简单的许可问题。欢迎任何进一步的解决方案，但就目前而言，这就是我解决这个问题的方式。

Answer 2

回答by Nirmal Ram

If you run the job as yarn-cluster mode, the driver will be running in any of the machine which is managed by YARN, so if saveAsTextFilehas local file path, then it will store the output in any of the machine where driver is running.

如果您将作业作为运行yarn-cluster mode，驱动程序将在任何由 YARN 管理的机器上运行，因此如果saveAsTextFile有本地文件路径，那么它会将输出存储在驱动程序运行的任何机器中。

Try running the job as yarn-client modeso the driver runs in the client machine

尝试运行作业，yarn-client mode以便驱动程序在客户端计算机中运行

Answer 3

回答by SanthoshPrasad

Use forEachPartition method, and then for each partition get file system object and write one by one record to it, below is the sample code here i am writing to hdfs, instead you can use local file system as well

使用 forEachPartition 方法，然后为每个分区获取文件系统对象并一条一条记录写入其中，下面是我正在写入 hdfs 的示例代码，您也可以使用本地文件系统

Dataset<String> ds=....

ds.toJavaRdd.foreachPartition(new VoidFunction<Iterator<String>>() {
    @Override
    public void call(Iterator<String> iterator) throws Exception {

    final FileSystem hdfsFileSystem = FileSystem.get(URI.create(finalOutPathLocation), hadoopConf);

    final FSDataOutputStream fsDataOutPutStream = hdfsFileSystem.exists(finalOutPath)
            ? hdfsFileSystem.append(finalOutPath) : hdfsFileSystem.create(finalOutPath);


    long processedRecCtr = 0;
    long failedRecsCtr = 0;


    while (iterator.hasNext()) {

        try {
            fsDataOutPutStream.writeUTF(iterator.next);
        } catch (Exception e) {
            failedRecsCtr++;
        }
        if (processedRecCtr % 3000 == 0) {
            LOGGER.info("Flushing Records");
            fsDataOutPutStream.flush();
        }
    }

    fsDataOutPutStream.close();
        }
});

Answer 4

回答by akaHuman

Please refer to the spark documentation to understand the use of --masteroption in spark-submit.

请参阅火花文档来了解使用--master的选项spark-submit。

--master localis supposed to be used when running locally.
--master yarn --deploy-mode clusteris supposed to be used when actually running on a yarn cluster.

--master local应该在本地运行时使用。
--master yarn --deploy-mode cluster应该在实际运行在纱线集群上时使用。

Refer thisand this.

参考这个和这个。

Answer 5

回答by Ricardo

Check if you are trying to run/write the file with a user other than the Spark service. On that situation you can solve the permission issue by presetting the directory ACLs. Example:

检查您是否正在尝试使用 Spark 服务以外的用户运行/写入文件。在这种情况下，您可以通过预设目录 ACL 来解决权限问题。例子：

setfacl -d -m group:spark:rwx /path/to/

(modify "spark" to your user group trying to write the file)

（将“spark”修改为您尝试写入文件的用户组）

scala 在集群模式下使用 Spark 将文件写入本地系统

提问by tkrhgch

Edited (2016/11/25)

已编辑 (2016/11/25)

回答by tkrhgch

回答by Nirmal Ram

回答by SanthoshPrasad

回答by akaHuman

回答by Ricardo

相关推荐

最近更新

标签

scala 在集群模式下使用 Spark 将文件写入本地系统

提问by tkrhgch

Edited (2016/11/25)

已编辑 (2016/11/25)

回答by tkrhgch

回答by Nirmal Ram

回答by SanthoshPrasad

回答by akaHuman

回答by Ricardo

相关推荐

scala 如何将 DataFrame 保存为压缩（gzipped）CSV？

scala Spark 数据帧中的序列

scala Spark SQL - IN 子句

scala 尝试从 Artifactory 虚拟存储库下载时，SBT 无法找到凭据

相关推荐

最近更新

标签