Spark 作业抛出“java.lang.OutOfMemoryError:超出 GC 开销限制”
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30853129/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark job throwing "java.lang.OutOfMemoryError: GC overhead limit exceeded"
提问by diplomaticguru
I have a Spark job that throws "java.lang.OutOfMemoryError: GC overhead limit exceeded".
我有一个引发“java.lang.OutOfMemoryError:GC 开销限制超出”的 Spark 作业。
The job is trying to process a filesize 4.5G.
该作业正在尝试处理一个 4.5G 的文件。
I've tried following spark configuration:
我试过以下火花配置:
--num-executors 6 --executor-memory 6G --executor-cores 6 --driver-memory 3G
I tried increasing more cores and executor which sometime works, but takes over 20 minutes to process the file.
我尝试增加更多的内核和执行器,它们有时可以工作,但需要 20 多分钟来处理文件。
Could I do something to improve the performance? or stop the Java Heap issue?
我可以做些什么来提高性能吗?或停止 Java 堆问题?
回答by Vijay Innamuri
Only solution is to fine tune the configuration.
唯一的解决方案是微调配置。
As per my experience I can say the following points for OOM:
根据我的经验,对于 OOM,我可以说以下几点:
- cache an RDD only if you are going to use it more than once
- 仅当您要多次使用 RDD 时才缓存它
Still if you need to cache then consider then analyze the data and application with respect to resources.
如果您需要缓存,那么请考虑然后分析与资源相关的数据和应用程序。
- If your cluster has enough memory then increase the
spark.executor.memory
to its max - Increase the no of partitions to increase the parallelism
- Increase the dedicated memory for caching
spark.storage.memoryFraction
. If lot of shuffle memory is involved then try to avoid or split the allocation carefully - Spark's caching feature Persist(MEMORY_AND_DISK) is available at the cost of additional processing (serializing, writing and reading back the data). Usually CPU usage will be too high in this case
- 如果您的集群有足够的内存,则将其增加到
spark.executor.memory
最大值 - 增加分区数以增加并行度
- 增加缓存的专用内存
spark.storage.memoryFraction
。如果涉及大量 shuffle 内存,则尝试避免或仔细拆分分配 - Spark 的缓存特性 Persist(MEMORY_AND_DISK) 需要额外的处理(序列化、写入和读回数据)。在这种情况下,通常 CPU 使用率会过高
回答by SanS
You can try increasing the driver-memory. If you don't have enough memory may be you can reduce it from executor-memory
Check the spark-ui to see what is the scheduler delay. You can access the spark UI on port 4040. If the scheduler delay is high, quite often, the driver may be shipping large amount of data to the executors. Which needs to be fixed.
您可以尝试增加driver-memory。如果您没有足够的内存,您可以从执行程序内存中减少它
检查 spark-ui 以查看调度程序延迟是多少。您可以在端口 4040 上访问 spark UI。如果调度程序延迟很高,通常,驱动程序可能会向执行程序发送大量数据。哪个需要修复。