Spark 作业抛出“java.lang.OutOfMemoryError：超出 GC 开销限制”

Question

提问by diplomaticguru

I have a Spark job that throws "java.lang.OutOfMemoryError: GC overhead limit exceeded".

我有一个引发“java.lang.OutOfMemoryError：GC 开销限制超出”的 Spark 作业。

The job is trying to process a filesize 4.5G.

该作业正在尝试处理一个 4.5G 的文件。

I've tried following spark configuration:

我试过以下火花配置：

--num-executors 6  --executor-memory 6G --executor-cores 6 --driver-memory 3G

I tried increasing more cores and executor which sometime works, but takes over 20 minutes to process the file.

我尝试增加更多的内核和执行器，它们有时可以工作，但需要 20 多分钟来处理文件。

Could I do something to improve the performance? or stop the Java Heap issue?

我可以做些什么来提高性能吗？或停止 Java 堆问题？

Answer 1

Only solution is to fine tune the configuration.

唯一的解决方案是微调配置。

As per my experience I can say the following points for OOM:

根据我的经验，对于 OOM，我可以说以下几点：

Still if you need to cache then consider then analyze the data and application with respect to resources.

如果您需要缓存，那么请考虑然后分析与资源相关的数据和应用程序。

If your cluster has enough memory then increase the spark.executor.memoryto its max
Increase the no of partitions to increase the parallelism
Increase the dedicated memory for caching spark.storage.memoryFraction. If lot of shuffle memory is involved then try to avoid or split the allocation carefully
Spark's caching feature Persist(MEMORY_AND_DISK) is available at the cost of additional processing (serializing, writing and reading back the data). Usually CPU usage will be too high in this case

Answer 2

You can try increasing the driver-memory. If you don't have enough memory may be you can reduce it from executor-memory
Check the spark-ui to see what is the scheduler delay. You can access the spark UI on port 4040. If the scheduler delay is high, quite often, the driver may be shipping large amount of data to the executors. Which needs to be fixed.

您可以尝试增加driver-memory。如果您没有足够的内存，您可以从执行程序内存中减少它
检查 spark-ui 以查看调度程序延迟是多少。您可以在端口 4040 上访问 spark UI。如果调度程序延迟很高，通常，驱动程序可能会向执行程序发送大量数据。哪个需要修复。