scala Spark:读取文本文件后的重新分区策略
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28127119/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark: Repartition strategy after reading text file
提问by Stephane
I have launched my cluster this way:
我以这种方式启动了我的集群:
/usr/lib/spark/bin/spark-submit --class MyClass --master yarn-cluster--num-executors 3 --driver-memory 10g --executor-memory 10g --executor-cores 4 /path/to/jar.jar
The first thing I do is read a big text file, and count it:
我做的第一件事是读取一个大文本文件,并计算它:
val file = sc.textFile("/path/to/file.txt.gz")
println(file.count())
When doing this, I see that only one of my nodes is actually reading the file and executing the count (because I only see one task). Is that expected? Should I repartition my RDD afterwards, or when I use map reduce functions, will Spark do it for me?
执行此操作时,我看到只有一个节点实际上在读取文件并执行计数(因为我只看到一项任务)。这是预期的吗?之后我应该重新分区我的 RDD,还是当我使用 map reduce 函数时,Spark 会为我做吗?
回答by Nick Chammas
It looks like you're working with a gzipped file.
看起来您正在使用 gzip 压缩文件。
Quoting from my answer here:
I think you've hit a fairly typical problem with gzipped files in that they cannot be loaded in parallel. More specifically, a single gzipped file cannot be loaded in parallel by multiple tasks, so Spark will load it with 1 task and thus give you an RDD with 1 partition.
我认为您遇到了 gzipped 文件的一个相当典型的问题,因为它们无法并行加载。更具体地说,单个 gzipped 文件不能由多个任务并行加载,因此 Spark 将使用 1 个任务加载它,从而为您提供一个带有 1 个分区的 RDD。
You need to explicitly repartition the RDD after loading it so that more tasks can run on it parallel.
您需要在加载 RDD 后显式重新分区,以便更多任务可以在其上并行运行。
For example:
例如:
val file = sc.textFile("/path/to/file.txt.gz").repartition(sc.defaultParallelism * 3)
println(file.count())
Regarding the comments on your question, the reason setting minPartitionsdoesn't help here is because a gzipped file is not splittable, so Spark will always use 1 task to read the file.
关于您的问题的评论,设置minPartitions在这里没有帮助的原因是因为gzipped 文件不可拆分,因此 Spark 将始终使用 1 个任务来读取文件。
If you set minPartitionswhen reading a regular text file, or a file compressed with a splittable compression format like bzip2, you'll see that Spark will actually deploy that number of tasks in parallel (up to the number of cores available in your cluster) to read the file.
如果您minPartitions在读取常规文本文件或使用 bzip2 等可拆分压缩格式压缩的文件时进行设置,您将看到 Spark 实际上将并行部署该数量的任务(最多为集群中可用的核心数量)以读取文件。

