scala 火花错误:executor.CoarseGrainedExecutorBackend:收到信号条款
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47907561/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark Error : executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
提问by Vishal
I am working with following spark config
我正在使用以下火花配置
maxCores = 5
driverMemory=2g
executorMemory=17g
executorInstances=100
Issue: Out of 100 Executors, My job ends up with only 10 active executors, nonetheless enough memory is available. Even tried setting the executors to 250 only 10 remains active.All I am trying to do is loading a mulitpartition hive table and doing df.count over it.
问题:在 100 个 Executor 中,我的工作最终只有 10 个活动的 executor,但仍有足够的可用内存。即使尝试将执行程序设置为 250,只有 10 个仍然处于活动状态。我要做的就是加载一个多分区配置单元表并对其执行 df.count。
Please help me understanding the issue causing the executors kill
17/12/20 11:08:21 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
17/12/20 11:08:21 INFO storage.DiskBlockManager: Shutdown hook called
17/12/20 11:08:21 INFO util.ShutdownHookManager: Shutdown hook called
Not sure why yarn is killing my executors.
不知道为什么纱线会杀死我的执行者。
回答by maffe
I faced a similar issue where the investigation of the NodeManager-Logs lead me to the root cause. You can access them via the Web-interface
我遇到了一个类似的问题,对 NodeManager-Logs 的调查使我找到了根本原因。您可以通过 Web 界面访问它们
nodeManagerAddress:PORT/logs
The PORTis specified in the yarn-site.xmlunder yarn.nodemanager.webapp.address. (default: 8042)
该端口在指定纱的site.xml下yarn.nodemanager.webapp.address。(默认:8042)
My Investigation-Workflow:
我的调查工作流程:
- Collect logs (yarn logs ... command)
- Identify node and container (in these logs) emitting the error
- Search the NodeManager-logs by Timestamp of the errorfor a root cause
- 收集日志(纱线日志...命令)
- 识别发出错误的节点和容器(在这些日志中)
- 按错误的时间戳在 NodeManager 日志中搜索根本原因
Btw: you can access the aggregated collection (xml) of all configurations affecting a node at the same port with:
顺便说一句:您可以通过以下方式访问影响同一端口节点的所有配置的聚合集合 (xml):
nodeManagerAdress:PORT/conf
回答by JumpMan
I believe this issue has more to do with the memory and the dynamic time allocations on executor/container levels. Make sure you can change the config params on executor/container level.
我相信这个问题更多地与执行程序/容器级别的内存和动态时间分配有关。确保您可以在执行程序/容器级别更改配置参数。
One of the ways you can resolve this issue is by changing this config value either on your spark-shell or spark job.
解决此问题的方法之一是在 spark-shell 或 spark 作业上更改此配置值。
spark.dynamicAllocation.executorIdleTimeout
This thread has more detailed information on how to resolve this issue which worked for me: https://jira.apache.org/jira/browse/SPARK-21733
这个线程有关于如何解决这个对我有用的问题的更详细的信息:https: //jira.apache.org/jira/browse/SPARK-21733

