scala Spark：读取文本文件后的重新分区策略

Question

提问by Stephane

I have launched my cluster this way:

我以这种方式启动了我的集群：

/usr/lib/spark/bin/spark-submit --class MyClass --master yarn-cluster--num-executors 3 --driver-memory 10g --executor-memory 10g --executor-cores 4 /path/to/jar.jar

The first thing I do is read a big text file, and count it:

我做的第一件事是读取一个大文本文件，并计算它：

val file = sc.textFile("/path/to/file.txt.gz")
println(file.count())

When doing this, I see that only one of my nodes is actually reading the file and executing the count (because I only see one task). Is that expected? Should I repartition my RDD afterwards, or when I use map reduce functions, will Spark do it for me?

执行此操作时，我看到只有一个节点实际上在读取文件并执行计数（因为我只看到一项任务）。这是预期的吗？之后我应该重新分区我的 RDD，还是当我使用 map reduce 函数时，Spark 会为我做吗？

Answer 1

回答by Nick Chammas

It looks like you're working with a gzipped file.

看起来您正在使用 gzip 压缩文件。

Quoting from my answer here:

在这里引用我的回答：

I think you've hit a fairly typical problem with gzipped files in that they cannot be loaded in parallel. More specifically, a single gzipped file cannot be loaded in parallel by multiple tasks, so Spark will load it with 1 task and thus give you an RDD with 1 partition.

我认为您遇到了 gzipped 文件的一个相当典型的问题，因为它们无法并行加载。更具体地说，单个 gzipped 文件不能由多个任务并行加载，因此 Spark 将使用 1 个任务加载它，从而为您提供一个带有 1 个分区的 RDD。

You need to explicitly repartition the RDD after loading it so that more tasks can run on it parallel.

您需要在加载 RDD 后显式重新分区，以便更多任务可以在其上并行运行。

For example:

例如：

val file = sc.textFile("/path/to/file.txt.gz").repartition(sc.defaultParallelism * 3)
println(file.count())

Regarding the comments on your question, the reason setting minPartitionsdoesn't help here is because a gzipped file is not splittable, so Spark will always use 1 task to read the file.

关于您的问题的评论，设置minPartitions在这里没有帮助的原因是因为gzipped 文件不可拆分，因此 Spark 将始终使用 1 个任务来读取文件。

If you set minPartitionswhen reading a regular text file, or a file compressed with a splittable compression format like bzip2, you'll see that Spark will actually deploy that number of tasks in parallel (up to the number of cores available in your cluster) to read the file.

如果您minPartitions在读取常规文本文件或使用 bzip2 等可拆分压缩格式压缩的文件时进行设置，您将看到 Spark 实际上将并行部署该数量的任务（最多为集群中可用的核心数量）以读取文件。

scala Spark：读取文本文件后的重新分区策略

提问by Stephane

回答by Nick Chammas

相关推荐

最近更新

标签

scala Spark：读取文本文件后的重新分区策略

提问by Stephane

回答by Nick Chammas

相关推荐

scala 什么是 Apache Spark 中的随机读取和随机写入

Mockito 匹配器、scala 值类和 NullPointerException

scala 如何在 sc.textFile 中加载本地文件，而不是 HDFS

如何从 Scala 的资源文件夹中读取文件？

相关推荐

最近更新

标签