Python 使用 hadoop 流和 mrjob 运行作业:PipeMapRed.waitOutputThreads(): subprocess failed with code 1

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17037300/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:16:15  来源:igfitidea点击:

Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

pythonhadoopmapreducehadoop-streamingmrjob

提问by Kiran Karanth

Hey I'm fairly new to the world of Big Data. I came across this tutorial on http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/

嘿,我对大数据世界还很陌生。我在http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/上看到了这个教程

It describes in detail of how to run MapReduce job using mrjob both locally and on Elastic Map Reduce.

它详细描述了如何在本地和 Elastic Map Reduce 上使用 mrjob 运行 MapReduce 作业。

Well I'm trying to run this on my own Hadoop cluser. I ran the job using the following command.

好吧,我正在尝试在我自己的 Hadoop 集群上运行它。我使用以下命令运行了该作业。

python density.py tiny.dat -r hadoop --hadoop-bin /usr/bin/hadoop > outputmusic

And this is what I get:

这就是我得到的:

HADOOP: Running job: job_1369345811890_0245
HADOOP: Job job_1369345811890_0245 running in uber mode : false
HADOOP:  map 0% reduce 0%
HADOOP: Task Id : attempt_1369345811890_0245_m_000000_0, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP:         at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP:         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP:         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP:         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.run(YarnChild.java:157)
HADOOP:         at java.security.AccessController.doPrivileged(Native Method)
HADOOP:         at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP:         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP: Task Id : attempt_1369345811890_0245_m_000001_0, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP:         at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP:         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP:         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP:         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.run(YarnChild.java:157)
HADOOP:         at java.security.AccessController.doPrivileged(Native Method)
HADOOP:         at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP:         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP: Task Id : attempt_1369345811890_0245_m_000000_1, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP:         at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP:         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP:         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP:         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.run(YarnChild.java:157)
HADOOP:         at java.security.AccessController.doPrivileged(Native Method)
HADOOP:         at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP:         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP: Container killed by the ApplicationMaster.
HADOOP:
HADOOP:
HADOOP: Task Id : attempt_1369345811890_0245_m_000001_1, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP:         at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP:         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP:         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP:         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.run(YarnChild.java:157)
HADOOP:         at java.security.AccessController.doPrivileged(Native Method)
HADOOP:         at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP:         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP: Task Id : attempt_1369345811890_0245_m_000000_2, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP:         at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP:         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP:         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP:         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.run(YarnChild.java:157)
HADOOP:         at java.security.AccessController.doPrivileged(Native Method)
HADOOP:         at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP:         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP: Task Id : attempt_1369345811890_0245_m_000001_2, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP:         at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP:         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP:         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP:         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.run(YarnChild.java:157)
HADOOP:         at java.security.AccessController.doPrivileged(Native Method)
HADOOP:         at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP:         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP:  map 100% reduce 0%
HADOOP: Job job_1369345811890_0245 failed with state FAILED due to: Task failed task_1369345811890_0245_m_000001
HADOOP: Job failed as tasks failed. failedMaps:1 failedReduces:0
HADOOP:
HADOOP: Counters: 6
HADOOP:         Job Counters
HADOOP:                 Failed map tasks=7
HADOOP:                 Launched map tasks=8
HADOOP:                 Other local map tasks=6
HADOOP:                 Data-local map tasks=2
HADOOP:                 Total time spent by all maps in occupied slots (ms)=32379
HADOOP:                 Total time spent by all reduces in occupied slots (ms)=0
HADOOP: Job not Successful!
HADOOP: Streaming Command Failed!
STDOUT: packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.0.0-cdh4.2.1.jar] /tmp/streamjob3272348678857116023.jar tmpDir=null
Traceback (most recent call last):
  File "density.py", line 34, in <module>
    MRDensity.run()
  File "/usr/lib/python2.6/site-packages/mrjob-0.2.4-py2.6.egg/mrjob/job.py", line 344, in run
    mr_job.run_job()
  File "/usr/lib/python2.6/site-packages/mrjob-0.2.4-py2.6.egg/mrjob/job.py", line 381, in run_job
    runner.run()
  File "/usr/lib/python2.6/site-packages/mrjob-0.2.4-py2.6.egg/mrjob/runner.py", line 316, in run
    self._run()
  File "/usr/lib/python2.6/site-packages/mrjob-0.2.4-py2.6.egg/mrjob/hadoop.py", line 175, in _run
    self._run_job_in_hadoop()
  File "/usr/lib/python2.6/site-packages/mrjob-0.2.4-py2.6.egg/mrjob/hadoop.py", line 325, in _run_job_in_hadoop
    raise CalledProcessError(step_proc.returncode, streaming_args)
subprocess.CalledProcessError: Command '['/usr/bin/hadoop', 'jar', '/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.1.jar', '-cmdenv', 'PYTHONPATH=mrjob.tar.gz', '-input', 'hdfs:///user/E824259/tmp/mrjob/density.E824259.20130611.053850.343441/input', '-output', 'hdfs:///user/E824259/tmp/mrjob/density.E824259.20130611.053850.343441/output', '-cacheFile', 'hdfs:///user/E824259/tmp/mrjob/density.E824259.20130611.053850.343441/files/density.py#density.py', '-cacheArchive', 'hdfs:///user/E824259/tmp/mrjob/density.E824259.20130611.053850.343441/files/mrjob.tar.gz#mrjob.tar.gz', '-mapper', 'python density.py --step-num=0 --mapper --protocol json --output-protocol json --input-protocol raw_value', '-jobconf', 'mapred.reduce.tasks=0']' returned non-zero exit status 1

Note: As suggested in some other forums I've included

注意:正如我在其他一些论坛中所建议的那样

#! /usr/bin/python

at the beginning of both my python files density.py and track.py. It seems to have worked for most people but I still continue getting the above exceprions.

在我的 python 文件 density.py 和 track.py 的开头。它似乎对大多数人都有效,但我仍然继续得到上述异常。

Edit: I included the definition of one of the functions being used in the original density.py which was definied in another file track.py in density.py itself. The job ran succesfully. But it would really be helpful if someone knows why this is happening.

编辑:我包括了在原始 density.py 中使用的一个函数的定义,该函数是在另一个文件 track.py 中定义的。作业成功运行。但是,如果有人知道为什么会发生这种情况,那真的会很有帮助。

采纳答案by Andy Botelho

Error code 1 is a generic error for Hadoop Streaming. You can get this error code for two main reasons:

错误代码 1 是 Hadoop Streaming 的一般错误。您可以获得此错误代码的主要原因有两个:

  • Your Mapper and Reducer scripts are not executable (include the #!/usr/bin/python at the beginning of the script).

  • Your Python program is simply written wrong - you could have a syntax error or logical bug.

  • 您的 Mapper 和 Reducer 脚本不可执行(包括脚本开头的 #!/usr/bin/python)。

  • 你的 Python 程序写错了——你可能有语法错误或逻辑错误。

Unfortunately, error code 1 does not give you any details to see exactly what is wrong with your Python program.

不幸的是,错误代码 1 没有为您提供任何详细信息,以了解您的 Python 程序到底出了什么问题。

I was stuck with error code 1 for a while myself, and the way I figured it out was to simply run my Mapper script as a standalone python program: python mapper.py

我自己被错误代码 1 困住了一段时间,我想出来的方法是简单地将我的 Mapper 脚本作为独立的 Python 程序运行: python mapper.py

After doing this, I got a regular Python error that told me I was simply giving a function the wrong type of argument. I fixed my syntax error, and everything worked after that. So if possible, I'd run your Mapper or Reducer script as a standalone Python program to see if that gives you any insight on the reasoning for your error.

这样做之后,我收到了一个常规的 Python 错误,告诉我我只是给了一个函数错误类型的参数。我修复了语法错误,之后一切正常。因此,如果可能,我会将您的 Mapper 或 Reducer 脚本作为独立的 Python 程序运行,以查看是否可以让您深入了解错误的原因。

回答by JumpMan

I got the same error, sub-process failed with code 1

我得到了同样的错误, sub-process failed with code 1

[cloudera@quickstart ~]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/cloudera/input -output /user/cloudera/output_join -mapper /home/cloudera/join1_mapper.py -reducer /home/cloudera/join1_reducer.py
  1. This is primarily because of a hadoop unable to access your input files, or may be you have something in your input which is more than required, or something missing. So, be very very careful with the input directory and files you have in them. I would say, only place exactly required input files in the input directory for the assignment and remove rest of them.

  2. Also make sure your mapper and reducer files are executable. chmod +x mapper.pyand chmod +x reducer.py

  3. Run the mapper of reducer python file using catusing only mapper: cat join2_gen*.txt | ./mapper.py | sortusing reducer: cat join2_gen*.txt | ./mapper.py | sort | ./reducer.pyThe reason for running them using cat is because If your input files have any error you can remove them before you run on Hadoop cluster. Sometimes map/reduce jobs cannot find the python errors!!

  1. 这主要是因为 hadoop 无法访问您的输入文件,或者您的输入中可能有超出要求的内容,或者缺少某些内容。因此,对输入目录和其中的文件要非常小心。我会说,只将完全需要的输入文件放在分配的输入目录中,然后删除其余的文件。

  2. 还要确保您的映射器和化简器文件是可执行的。 chmod +x mapper.pychmod +x reducer.py

  3. cat只使用mapper运行reducer python文件的mapper: cat join2_gen*.txt | ./mapper.py | sortusing reducer: cat join2_gen*.txt | ./mapper.py | sort | ./reducer.py使用cat运行它们的原因是因为如果你的输入文件有任何错误,你可以在运行Hadoop集群之前删除它们。有时 map/reduce 作业找不到 python 错误!!

回答by khsaf

I faced the same problem when running , my mapper and reducer scripts were not executable.

我在运行时遇到了同样的问题,我的 mapper 和 reducer 脚本无法执行。

Adding #! /usr/bin/pythonat the top of my files fixed the issue.

#! /usr/bin/python在我的文件顶部添加解决了这个问题。

回答by linayou

Another reason, such as you have an error in your shell script to run the mapper.pyand reducer.py. Here is my suggestions:

另一个原因,例如您在 shell 脚本中运行mapper.pyreducer.py. 这是我的建议:

Firstly you should try to run you mapper.pyand reducer.pyin the local environment.

首先,你应该尝试运行您mapper.pyreducer.py当地环境。

Next you could try to track your mapreduce job on your url printed in the stdout log, like this:16:01:56 INFO mapreduce.Job: The url to track the job: http://xxxxxx:8088/proxy/application_xxx/" which has detailed error information. Hope this help!

接下来,您可以尝试在 stdout 日志中打印的 url 上跟踪您的 mapreduce 作业,如下所示:16:01:56 INFO mapreduce.Job:跟踪作业的 url:http://xxxxxx:8088/proxy/application_xxx/" 里面有详细的错误信息,希望能帮到你!