Java Yarn MapReduce 作业问题 - Hadoop 2.3.0 中的 AM 容器启动错误

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22579943/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 16:37:34  来源:igfitidea点击:

Yarn MapReduce Job Issue - AM Container launch error in Hadoop 2.3.0

javahadoopmapreduceyarn

提问by TonyMull

I have setup a 2 node cluster of Hadoop 2.3.0. Its working fine and I can successfully run distributedshell-2.2.0.jar example. But when I try to run any mapreduce job I get error. I have setup MapRed.xml and other configs for running MapReduce job according to (http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide) but I am getting following error :

我已经设置了一个 Hadoop 2.3.0 的 2 节点集群。它工作正常,我可以成功运行分布式shell-2.2.0.jar 示例。但是当我尝试运行任何 mapreduce 作业时,我会出错。我已经根据(http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide)设置了 MapRed.xml 和其他用于运行 MapReduce 作业的配置,但出现以下错误:

14/03/22 20:31:17 INFO mapreduce.Job: Job job_1395502230567_0001 failed with state FAILED due to: Application application_1395502230567_0001 failed 2 times due to AM Container for appattempt_1395502230567_0001_000002 exited 
with  exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: 
    org.apache.hadoop.util.Shell$ExitCodeException: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)


    Container exited with a non-zero exit code 1
    .Failing this attempt.. Failing the application.
    14/03/22 20:31:17 INFO mapreduce.Job: Counters: 0
    Job ended: Sat Mar 22 20:31:17 PKT 2014
    The job took 6 seconds.

And if look at stderr (log of job) there is only one line "Could not find or load main class 614"

如果查看 stderr(作业日志),则只有一行 "Could not find or load main class 614"

Now I have googled it and usually this issues comes when you have different JAVA versions or in yarn-site.xmlclasspath is not properly set , my yarn-site.xmlhas this

现在我用谷歌搜索了它,通常当你有不同的 JAVA 版本或yarn-site.xml类路径设置不正确时,这个问题就会出现,我yarn-site.xml有这个

  <property>
    <name>yarn.application.classpath</name>
    <value>/opt/yarn/hadoop-2.3.0/etc/hadoop,/opt/yarn/hadoop-2.3.0/*,/opt/yarn/hadoop-2.3.0/lib/*,/opt/yarn/hadoop-2.3.0/*,/opt/yarn/hadoop-2.3.0/lib/*,/opt/yarn/hadoop-2.3.0/*,/opt/yarn/hadoop-2.3.0/lib/*,/opt/yarn/hadoop-2.3.0/*,/opt/yarn/hadoop-2.3.0/lib/*</value>
  </property>

So any other ideas what could be the issue here ?

那么还有什么其他想法可能是这里的问题吗?

I am running my mapreduce job like this:

我正在像这样运行我的 mapreduce 工作:

$HADOOP_PREFIX/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar randomwriter out

回答by TonyMull

I fixed the issue, it was due to incorrect paths. By giving full dir path to mapred , hdfs , yarn & common solves the problem.

我解决了这个问题,这是由于路径不正确。通过为 mapred 、 hdfs 、 yarn 和 common 提供完整的目录路径解决了这个问题。

Thanks, Tony

谢谢,托尼

回答by Doug Judd

I encountered the same problem when trying to install Hortonworks HDP 2.1 manually. I managed to capture the container launcher script which contained the following:

我在尝试手动安装 Hortonworks HDP 2.1 时遇到了同样的问题。我设法捕获了包含以下内容的容器启动程序脚本:

#!/bin/bash

export NM_HTTP_PORT="8042"
export LOCAL_DIRS="/data/1/hadoop/yarn/local/usercache/doug/appcache/application_1406927878786_0001,/data/2/hadoop/yarn/local/usercache/doug/appcache/application_1406927878786_0001,/data/3/hadoop/yarn/local/usercache/doug/appcache/application_1406927878786_0001,/data/4/hadoop/yarn/local/usercache/doug/appcache/application_1406927878786_0001"
export JAVA_HOME="/usr/java/latest"
export NM_AUX_SERVICE_mapreduce_shuffle="AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="
export CLASSPATH="$PWD:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:job.jar/job.jar:job.jar/classes/:job.jar/lib/*:$PWD/*"
export HADOOP_TOKEN_FILE_LOCATION="/data/2/hadoop/yarn/local/usercache/doug/appcache/application_1406927878786_0001/container_1406927878786_0001_01_000001/container_tokens"
export NM_HOST="test02.admin.hypertable.com"
export APPLICATION_WEB_PROXY_BASE="/proxy/application_1406927878786_0001"
export JVM_PID="$$"
export USER="doug"
export HADOOP_HDFS_HOME="/usr/lib/hadoop-hdfs"
export PWD="/data/2/hadoop/yarn/local/usercache/doug/appcache/application_1406927878786_0001/container_1406927878786_0001_01_000001"
export CONTAINER_ID="container_1406927878786_0001_01_000001"
export HOME="/home/"
export NM_PORT="62404"
export LOGNAME="doug"
export APP_SUBMIT_TIME_ENV="1406928095871"
export MAX_APP_ATTEMPTS="2"
export HADOOP_CONF_DIR="/etc/hadoop/conf"
export MALLOC_ARENA_MAX="4"
export LOG_DIRS="/data/1/hadoop/yarn/logs/application_1406927878786_0001/container_1406927878786_0001_01_000001,/data/2/hadoop/yarn/logs/application_1406927878786_0001/container_1406927878786_0001_01_000001,/data/3/hadoop/yarn/logs/application_1406927878786_0001/container_1406927878786_0001_01_000001,/data/4/hadoop/yarn/logs/application_1406927878786_0001/container_1406927878786_0001_01_000001"
ln -sf "/data/1/hadoop/yarn/local/usercache/doug/filecache/10/libthrift-0.9.2.jar" "libthrift-0.9.2.jar"
ln -sf "/data/4/hadoop/yarn/local/usercache/doug/appcache/application_1406927878786_0001/filecache/13/job.xml" "job.xml"
mkdir -p jobSubmitDir
ln -sf "/data/3/hadoop/yarn/local/usercache/doug/appcache/application_1406927878786_0001/filecache/12/job.split" "jobSubmitDir/job.split"
mkdir -p jobSubmitDir
ln -sf "/data/2/hadoop/yarn/local/usercache/doug/appcache/application_1406927878786_0001/filecache/11/job.splitmetainfo" "jobSubmitDir/job.splitmetainfo"
ln -sf "/data/1/hadoop/yarn/local/usercache/doug/appcache/application_1406927878786_0001/filecache/10/job.jar" "job.jar"
ln -sf "/data/2/hadoop/yarn/local/usercache/doug/filecache/11/hypertable-0.9.8.0-apache2.jar" "hypertable-0.9.8.0-apache2.jar"
exec /bin/bash -c "$JAVA_HOME/bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/data/4/hadoop/yarn/logs/application_1406927878786_0001/container_1406927878786_0001_01_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA  -Xmx1024m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/data/4/hadoop/yarn/logs/application_1406927878786_0001/container_1406927878786_0001_01_000001/stdout 2>/data/4/hadoop/yarn/logs/application_1406927878786_0001/container_1406927878786_0001_01_000001/stderr "

The line that sets CLASSPATHwas the culprit. To resolve the problem I had to set the variables HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, HADOOP_YARN_HOME, and HADOOP_MAPRED_HOMEin hadoop-env.shto point to the appropriate directories under /usr/lib. In each of those directories I also had to setup the share/hadoop/...subdirectory hierarchy where the jars could be found.

设置的线CLASSPATH是罪魁祸首。要解决这个问题,我不得不设置变量HADOOP_COMMON_HOMEHADOOP_HDFS_HOMEHADOOP_YARN_HOME,和HADOOP_MAPRED_HOMEhadoop-env.sh以点下相应的目录/usr/lib。在每个目录中,我还必须设置share/hadoop/...可以找到 jar的子目录层次结构。

回答by akshat thakar

Please check the property. Ensure all required jars are present.

请检查物业。确保所有必需的罐子都存在。

**yarn.application.classpath** /etc/hadoop/conf,/usr/lib/hadoop/,/usr/lib/hadoop/lib/,/usr/lib/hadoop-hdfs/,/usr/lib/hadoop-hdfs/lib/,/usr/lib/hadoop-yarn/,/usr/lib/hadoop-yarn/lib/,/usr/lib/hadoop-mapreduce/,/usr/lib/hadoop-mapreduce/lib/

**yarn.application.classpath** /etc/hadoop/conf,/​​usr/lib/hadoop/ ,/usr/lib/hadoop/lib/,/usr/lib/hadoop-hdfs/ ,/usr/lib/hadoop-hdfs/lib/,/usr/ lib/hadoop-yarn/ ,/usr/lib/hadoop-yarn/lib/,/usr/lib/hadoop-mapreduce/ ,/usr/lib/hadoop-mapreduce/lib/

回答by Harit Singh

Please check the logs first (they will be in user directory under logs directory of Hadoop).

请先检查日志(它们将在 Hadoop 的日志目录下的用户目录中)。

Also check the permissions of all directories you mentioned in yarn, hdfs, core-site XML files. Because this error is caused by wrong permission issues in most cases.

还要检查您在 yarn、hdfs、core-site XML 文件中提到的所有目录的权限。因为这个错误在大多数情况下是由错误的权限问题引起的。

回答by iceberg

Maybe you can run HistoryServer with following code under $HADOOP_HOME/bin,

也许您可以在 $HADOOP_HOME/bin 下使用以下代码运行 HistoryServer,

./mr-jobhistory-daemon.sh start historyserver

And then you can control logs of Hadoop Error from this url, (History Log)

然后你可以从这个 url 控制 Hadoop 错误的日志,(历史日志)

http://<Resource Manager Host name adress>:8088/cluster

And Most probably You get Class Not Found Exception

并且很可能你得到 Class Not Found Exception

回答by li long'en

I also encountered this issue on Ambari 2.0 + HDP2.3 + HUE3.9 my fix experiece is: 1. make sure spark client exist on all hadoop yarn node 2. export SPARK_HOME on all yarn node (spark client), and hue host

我在 Ambari 2.0 + HDP2.3 + HUE3.9 上也遇到过这个问题,我的修复经验是: 1. 确保所有 hadoop yarn 节点上都存在 spark 客户端 2. 在所有 yarn 节点(spark 客户端)和 Hue 主机上导出 SPARK_HOME

回答by Nimmagadda

The permissions should be 6050 owner:root group hadoop

权限应该是 6050 owner:root group hadoop

---Sr-s--- 1 root hadoop /usr/lib/hadoop-yarn/bin/container-executor

---Sr-s--- 1 根 hadoop /usr/lib/hadoop-yarn/bin/container-executor

回答by Nader Khalid

You will need to delay log removal by setting yarn.nodemanager.delete.debug-delay-secto 600.

您需要通过设置yarn.nodemanager.delete.debug-delay-sec为 600来延迟日志删除。

This will allow you to browse the stderr, stdout and syslog in /hadoop/yarn/login the relevant container directory.

这将允许您浏览/hadoop/yarn/log相关容器目录中的 stderr、stdout 和 syslog 。

Most likely, you will find the error in syslog. And, most likely, it will be a ClassNotFoundExceptionfor class tez.history.logging.service.class=org.apache.tez.dag.history.logging.ats.ATSV15HistoryLoggingService.

最有可能的是,您会在 syslog 中发现错误。而且,最有可能的是,它是一个ClassNotFoundExceptionfor 类 tez.history.logging.service.class=org.apache.tez.dag.history.logging.ats.ATSV15HistoryLoggingService。

If that is the case, then refer to the following ticket:

如果是这种情况,请参阅以下票证:

https://issues.apache.org/jira/browse/AMBARI-15041

https://issues.apache.org/jira/browse/AMBARI-15041

回答by Igorock

Check Swap size in your system: free -mIf there is Swap: 0 0 0allocate Swap memory following these instructions

检查系统中的交换大小:free -m如果Swap: 0 0 0按照以下说明分配交换内存

回答by Laura

In my case the problem was due to insufficient memory. I inserted the below into yarn-site-xml as adino suggested in his comment above:

在我的情况下,问题是由于内存不足。我按照 adino 在上面的评论中的建议将以下内容插入到 yarn-site-xml 中:

<property> <name>yarn.nodemanager.delete.debug-delay-sec</name> <value>600</value> </property>

After that I could see an error in the stderr log file. I don't remember the exact wording (logfile got deleted after a while). It was along the lines of "out of memory error"

之后,我可以在 stderr 日志文件中看到错误。我不记得确切的措辞(一段时间后日志文件被删除)。这是沿着“内存不足错误”的路线

I edited my virtual machine to add another swap partition of the size 3 Gigabytes (probably total overkill). I did this with Gparted.

我编辑了我的虚拟机以添加另一个大小为 3 GB 的交换分区(可能完全矫枉过正)。我用 Gparted 做到了这一点。

Afterwards I had to register the new swap partition by typing

之后我必须通过键入来注册新的交换分区

mkswap /dev/sda6 (/dev/sda6 is the partition name)
swapon /dev/sda6 

I found the uid of the new swap partition by typing "blkid" and copying the uid.

我通过键入“blkid”并复制 uid 找到了新交换分区的 uid。

I registered the swap into the file fstab:

我将交换注册到文件 fstab 中:

sudo vi /etc/fstab

I added a new line for the new swap partition. I copied the whole line from the previous swap partition and just changed the UID.

我为新的交换分区添加了一个新行。我从以前的交换分区复制了整行,只是更改了 UID。

UUID=2d29cddd-e721-4a7b-95c0-7ce52734d8a3 none  swap    sw      0       0

After this, the error disappeared. I'm sure there's more elegant ways to solve this, but this worked for me. I'm pretty new to dealing with Linux.

在此之后,错误消失了。我确信有更优雅的方法来解决这个问题,但这对我有用。我对处理 Linux 还很陌生。