scala 使用 Spark 的间歇性超时异常
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27039954/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Intermittent Timeout Exception using Spark
提问by dirceusemighini
I've a Spark cluster with 10 nodes, and I'm getting this exception after using the Spark Context for the first time:
我有一个包含 10 个节点的 Spark 集群,在第一次使用 Spark Context 后出现此异常:
14/11/20 11:15:13 ERROR UserGroupInformation: PriviledgedActionException as:iuberdata (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
... 4 more
This guyhave had a similar problem but I've already tried his solution and didn't worked.
这家伙也有类似的问题,但我已经尝试过他的解决方案,但没有奏效。
The same exception also happens herebut the problem isn't them same in here as I'm using spark version 1.1.0 in both master or slave and in client.
同样的异常也发生在这里,但问题不在这里,因为我在 master 或 slave 和 client 中使用 spark version 1.1.0。
I've tried to increase the timeout to 120s but it still doesn't solve the problem.
我试图将超时增加到 120 秒,但仍然没有解决问题。
I'm doploying the environment throught scripts and I'm using the context.addJar to include my code into the classpath. This problem is intermittend, and I don't have any idea on how to track why is it happening. Anybody has faced this issue when configuring a spark cluster know how to solve it?
我正在通过脚本来部署环境,并且我正在使用 context.addJar 将我的代码包含到类路径中。这个问题是间歇性的,我不知道如何跟踪它为什么会发生。有人在配置 Spark 集群时遇到过这个问题,知道如何解决吗?
采纳答案by dirceusemighini
The Firewall was missconfigured and, in some instances, it didn't allowed the slaves to connect to the cluster. This generated the timeout issue, as the slaves couldn't connect to the server. If you are facing this timeout, check your firewall configs.
防火墙配置错误,在某些情况下,它不允许从服务器连接到集群。这产生了超时问题,因为从站无法连接到服务器。如果您遇到此超时,请检查您的防火墙配置。
回答by Saket
We had a similar problem which was quite hard to debug and isolate. Long story short - Spark uses Akka which is very picky about FQDN hostnames resolving to IP addresses. Even if you specify the IP Address at all places it is not enough. The answer herehelped us isolate the problem.
我们有一个类似的问题,很难调试和隔离。长话短说 - Spark 使用 Akka,它对解析 IP 地址的 FQDN 主机名非常挑剔。即使您在所有地方都指定了 IP 地址,这也是不够的。这里的答案帮助我们隔离了问题。
A useful test to run is run netcat -l <port>on the master and run nc -vz <host> <port>on the worker to test the connectivity. Run the test with an IP address and with the FQDN. You can get the name Spark is using from the WARN message from the log snippet below. For us it was host032s4.staging.companynameremoved.info. The IP address test for us passed and the FQDN test failed as our DNS was not setup correctly.
要运行的一个有用的测试是netcat -l <port>在 master 上运行nc -vz <host> <port>并在 worker 上运行以测试连接。使用 IP 地址和 FQDN 运行测试。您可以从下面的日志片段中的 WARN 消息中获取 Spark 使用的名称。对我们来说是host032s4.staging.companynameremoved.info。我们的 IP 地址测试通过,而 FQDN 测试失败,因为我们的 DNS 设置不正确。
INFO 2015-07-24 10:33:45 Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:35455]
INFO 2015-07-24 10:33:45 Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:35455]
INFO 2015-07-24 10:33:45 org.apache.spark.util.Utils: Successfully started service 'driverPropsFetcher' on port 35455.
WARN 2015-07-24 10:33:45 Remoting: Tried to associate with unreachable remote address [akka.tcp://[email protected]:50855]. Address is now gated for 60000 ms, all messages to this address will be delivered to dead letters.
ERROR 2015-07-24 10:34:15 org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:skumar cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
Another thing which we had to do was to specify the spark.driver.hostand spark.driver.portproperties in the spark submit script. This was because we had machines with two IP addresses and the FQDN resolved to the wrong IP address.
我们必须做的另一件事是在 spark 提交脚本中指定spark.driver.host和spark.driver.port属性。这是因为我们的机器有两个 IP 地址,而 FQDN 解析为错误的 IP 地址。
Make sure your network and DNS entries are correct!!
确保您的网络和 DNS 条目正确无误!!
回答by Greg Dubicki
I had similar problem and I managed to get around it by using clusterdeploy mode when submitting the application to Spark.
我遇到了类似的问题,cluster在将应用程序提交到 Spark时,我设法通过使用部署模式来解决它。
(Because even allowing all the incoming traffic to both my master and the single slave didn't allow me to use the clientdeploy mode. Before changing them I had default security group (AWS firewall) settings set up by Spark EC2 scriptsfrom Spark 1.2.0).
(因为即使允许所有传入流量到我的主服务器和单个从服务器也不允许我使用client部署模式。在更改它们之前,我通过Spark 1.2 的Spark EC2 脚本设置了默认安全组(AWS 防火墙)设置。 0)。

