python中的Hadoop Streaming Job失败错误

Question

提问by db42

From this guide, I have successfully run the sample exercise. But on running my mapreduce job, I am getting the following error
ERROR streaming.StreamJob: Job not Successful! 10/12/16 17:13:38 INFO streaming.StreamJob: killJob... Streaming Job Failed!
Error from the log file

根据本指南，我已成功运行示例练习。但是在运行我的 mapreduce 作业时，我从日志文件中收到以下错误错误
ERROR streaming.StreamJob: Job not Successful! 10/12/16 17:13:38 INFO streaming.StreamJob: killJob... Streaming Job Failed!

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

Mapper.py

映射器

import sys

i=0

for line in sys.stdin:
    i+=1
    count={}
    for word in line.strip().split():
        count[word]=count.get(word,0)+1
    for word,weight in count.items():
        print '%s\t%s:%s' % (word,str(i),str(weight))

Reducer.py

减速器.py

import sys

keymap={}
o_tweet="2323"
id_list=[]
for line in sys.stdin:
    tweet,tw=line.strip().split()
    #print tweet,o_tweet,tweet_id,id_list
    tweet_id,w=tw.split(':')
    w=int(w)
    if tweet.__eq__(o_tweet):
        for i,wt in id_list:
            print '%s:%s\t%s' % (tweet_id,i,str(w+wt))
        id_list.append((tweet_id,w))
    else:
        id_list=[(tweet_id,w)]
        o_tweet=tweet

[edit] command to run the job:

[编辑] 运行作业的命令：

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper /home/hadoop/mapper.py -file /home/hadoop/reducer.py -reducer /home/hadoop/reducer.py -input my-input/* -output my-output

Input is any random sequence of sentences.

输入是任何随机的句子序列。

Thanks,

谢谢，

Answer 1

采纳答案by Joe Stein

Your -mapper and -reducer should just be the script name.

您的 -mapper 和 -reducer 应该只是脚本名称。

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducer reducer.py -input my-input/* -output my-output

When your scripts are in the job that is in another folder within hdfs which is relative to the attempt task executing as "." (FYI if you ever want to ad another -file such as a look up table you can open it in Python as if it was in the same dir as your scripts while your script is in M/R job)

当您的脚本位于 hdfs 中另一个文件夹中的作业中时，该文件夹与作为“.”执行的尝试任务相关。（仅供参考，如果您想添加另一个文件，例如查找表，您可以在 Python 中打开它，就像在脚本处于 M/R 作业时它与脚本位于同一目录中一样）

also make sure you have chmod a+x mapper.py and chmod a+x reducer.py

还要确保你有 chmod a+x mapper.py 和 chmod a+x reducer.py

Answer 2

回答by Marvin W

Try to add

尝试添加

 #!/usr/bin/env python

top of your script.

脚本的顶部。

Or,

或者，

-mapper 'python m.py' -reducer 'r.py'

Answer 3

回答by Dolan Antenucci

I ran into this error recently, and my problem turned out to be something as obvious (in hindsight) as these other solutions:

我最近遇到了这个错误，结果证明我的问题与其他解决方案一样明显（事后看来）：

I simply had a bug in my Python code. (In my case, I was using Python v2.7 string formatting whereas the AWS EMR cluster I had was using Python v2.6).

我只是在我的 Python 代码中有一个错误。（就我而言，我使用的是 Python v2.7 字符串格式，而我拥有的 AWS EMR 集群使用的是 Python v2.6）。

To find the actual Python error, go to Job Tracker web UI (in the case of AWS EMR, port 9100 for AMI 2.x and port 9026 for AMI 3.x); find the failed mapper; open its logs; and read the stderr output.

要查找实际的 Python 错误，请转到 Job Tracker Web UI（对于 AWS EMR，AMI 2.x 的端口为 9100，AMI 3.x 的端口为 9026）；找到失败的映射器；打开它的日志；并读取 stderr 输出。

Answer 4

回答by user6454733

make sure your input directory only contains the correct files

确保您的输入目录只包含正确的文件

Answer 5

回答by yunus

I too had the same problem i tried solution of marvin W and i also install spark , ensure that u have installed spark , not just pyspark(dependency) but also install the framework installtion tutorial

我也有同样的问题，我尝试了 marvin W 的解决方案，我也安装了 spark，确保你已经安装了 spark，不仅仅是 pyspark（依赖），而且还安装了框架安装教程

follow that tutorial

按照那个教程

Answer 6

回答by Gopal Kumar

You need to explicitly instruct that mapper and reducer are used as python script, as we have several options for streaming. You can use either single quotes or double quotes.

您需要明确指示将 mapper 和 reducer 用作 python 脚本，因为我们有多个流选项。您可以使用单引号或双引号。

-mapper "python mapper.py" -reducer "python reducer.py"

or

或者

-mapper 'python mapper.py' -reducer 'python reducer.py'

The full command goes like this:

完整的命令是这样的：

hadoop jar /path/to/hadoop-mapreduce/hadoop-streaming.jar \
-input /path/to/input \
-output /path/to/output \
-mapper 'python mapper.py' \
-reducer 'python reducer.py' \
-file /path/to/mapper-script/mapper.py \
-file /path/to/reducer-script/reducer.py

python中的Hadoop Streaming Job失败错误

提问by db42

采纳答案by Joe Stein

回答by Marvin W

回答by Dolan Antenucci

回答by user6454733

回答by yunus

回答by Gopal Kumar

相关推荐

最近更新

标签

python中的Hadoop Streaming Job失败错误

提问by db42

采纳答案by Joe Stein

回答by Marvin W

回答by Dolan Antenucci

回答by user6454733

回答by yunus

回答by Gopal Kumar

相关推荐

以年为单位的两个日期之间的 Pythonic 差异？

Python 给定每个变量的概率，选择列表变量

Python 如何循环表单字段选择并显示关联的模型实例字段

如何在 Windows 启动时启动 python 文件？

相关推荐

最近更新

标签