scala 火花错误:executor.CoarseGrainedExecutorBackend:收到信号条款

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47907561/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:31:27  来源:igfitidea点击:

Spark Error : executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

scalaapache-spark

提问by Vishal

I am working with following spark config

我正在使用以下火花配置

maxCores = 5
 driverMemory=2g
 executorMemory=17g
 executorInstances=100

Issue: Out of 100 Executors, My job ends up with only 10 active executors, nonetheless enough memory is available. Even tried setting the executors to 250 only 10 remains active.All I am trying to do is loading a mulitpartition hive table and doing df.count over it.

问题:在 100 个 Executor 中,我的工作最终只有 10 个活动的 executor,但仍有足够的可用内存。即使尝试将执行程序设置为 250,只有 10 个仍然处于活动状态。我要做的就是加载一个多分区配置单元表并对其执行 df.count。

Please help me understanding the issue causing the executors kill
17/12/20 11:08:21 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
17/12/20 11:08:21 INFO storage.DiskBlockManager: Shutdown hook called
17/12/20 11:08:21 INFO util.ShutdownHookManager: Shutdown hook called

Not sure why yarn is killing my executors.

不知道为什么纱线会杀死我的执行者。

回答by maffe

I faced a similar issue where the investigation of the NodeManager-Logs lead me to the root cause. You can access them via the Web-interface

我遇到了一个类似的问题,对 NodeManager-Logs 的调查使我找到了根本原因。您可以通过 Web 界面访问它们

nodeManagerAddress:PORT/logs

The PORTis specified in the yarn-site.xmlunder yarn.nodemanager.webapp.address. (default: 8042)

端口在指定纱的site.xmlyarn.nodemanager.webapp.address。(默认:8042

My Investigation-Workflow:

我的调查工作流程:

  1. Collect logs (yarn logs ... command)
  2. Identify node and container (in these logs) emitting the error
  3. Search the NodeManager-logs by Timestamp of the errorfor a root cause
  1. 收集日志(纱线日志...命令)
  2. 识别发出错误的节点和容器(在这些日志中)
  3. 错误的时间戳在 NodeManager 日志中搜索根本原因

Btw: you can access the aggregated collection (xml) of all configurations affecting a node at the same port with:

顺便说一句:您可以通过以下方式访问影响同一端口节点的所有配置的聚合集合 (xml):

 nodeManagerAdress:PORT/conf

回答by JumpMan

I believe this issue has more to do with the memory and the dynamic time allocations on executor/container levels. Make sure you can change the config params on executor/container level.

我相信这个问题更多地与执行程序/容器级别的内存和动态时间分配有关。确保您可以在执行程序/容器级别更改配置参数。

One of the ways you can resolve this issue is by changing this config value either on your spark-shell or spark job.

解决此问题的方法之一是在 spark-shell 或 spark 作业上更改此配置值。

spark.dynamicAllocation.executorIdleTimeout

This thread has more detailed information on how to resolve this issue which worked for me: https://jira.apache.org/jira/browse/SPARK-21733

这个线程有关于如何解决这个对我有用的问题的更详细的信息:https: //jira.apache.org/jira/browse/SPARK-21733