bash Oozie shell 脚本操作

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22391274/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 09:55:12  来源:igfitidea点击:

Oozie shell script action

bashhadoophiveoozie

提问by thedragonwarrior

I am exploring the capabilities of Oozie for managing Hadoop workflows. I am trying to set up a shell action which invokes some hive commands. My shell script hive.sh looks like:

我正在探索 Oozie 管理 Hadoop 工作流的功能。我正在尝试设置一个调用一些 hive 命令的 shell 操作。我的 shell 脚本 hive.sh 看起来像:

#!/bin/bash
hive -f hivescript

Where the hive script (which has been tested independently) creates some tables and so on. My question is where to keep the hivescript and then how to reference it from the shell script.

hive 脚本(已独立测试)在哪里创建一些表等等。我的问题是在哪里保留 hivescript,然后如何从 shell 脚本中引用它。

I've tried two ways, first using a local path, like hive -f /local/path/to/file, and using a relative path like above, hive -f hivescript, in which case I keep my hivescript in the oozie app path directory (same as hive.sh and workflow.xml) and set it to go to the distributed cache via the workflow.xml.

我尝试了两种方法,首先使用本地路径,如hive -f /local/path/to/file,并使用如上的相对路径hive -f hivescript,在这种情况下,我将 hivescript 保留在 oozie 应用程序路径目录中(与 hive.sh 和工作流.xml 相同)并设置它通过workflow.xml 进入分布式缓存。

With both methods I get the error message: "Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]"on the oozie web console. Additionally I've tried using hdfs paths in shell scripts and this does not work as far as I know.

使用这两种方法,我都会收到错误消息: "Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]"在 oozie Web 控制台上。此外,我尝试在 shell 脚本中使用 hdfs 路径,但据我所知这不起作用。

My job.properties file:

我的 job.properties 文件:

nameNode=hdfs://sandbox:8020
jobTracker=hdfs://sandbox:50300   
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozieProjectRoot=${nameNode}/user/sandbox/poc1
appPath=${oozieProjectRoot}/testwf
oozie.wf.application.path=${appPath}

And workflow.xml:

和工作流.xml:

<shell xmlns="uri:oozie:shell-action:0.1">

    <job-tracker>${jobTracker}</job-tracker>

    <name-node>${nameNode}</name-node>

    <configuration>

        <property>

            <name>mapred.job.queue.name</name>

            <value>${queueName}</value>

        </property>

    </configuration>

    <exec>${appPath}/hive.sh</exec>

    <file>${appPath}/hive.sh</file> 

    <file>${appPath}/hive_pill</file>

</shell>

<ok to="end"/>

<error to="end"/>

</action>

<end name="end"/>

My objective is to use oozie to call a hive script through a shell script, please give your suggestions.

我的目标是使用oozie通过shell脚本调用hive脚本,请大家给点建议。

回答by Ryan Bedard

One thing that has always been tricky about Oozie workflows is the execution of bash scripts. Hadoop is created to be massively parallel so the architecture acts very different than you would think.

Oozie 工作流一直很棘手的一件事是 bash 脚本的执行。Hadoop 被创建为大规模并行,因此架构的行为与您想象的非常不同。

When an oozie workflow executes a shell action, it will receive resources from your job tracker or YARN on any of the nodes in your cluster. This means that using a local location for your file will not work, since the local storage is exclusively on your edge node. If the job happened to spawn on your edge node then it would work, but any other time it would fail, and this distribution is random.

当 oozie 工作流执行 shell 操作时,它将从集群中任何节点上的作业跟踪器或 YARN 接收资源。这意味着为您的文件使用本地位置将不起作用,因为本地存储仅在您的边缘节点上。如果作业碰巧在您的边缘节点上产生,那么它会起作用,但在其他任何时候它都会失败,并且这种分布是随机的。

To get around this, I found it best to have the files I needed (including the sh scripts) in hdfs in either a lib space or the same location as my workflow.

为了解决这个问题,我发现最好将我需要的文件(包括 sh 脚本)放在 hdfs 中的 lib 空间或与我的工作流程相同的位置。

Here is a good way to approach what you are trying to achieve.

这是接近您想要实现的目标的好方法。

<shell xmlns="uri:oozie:shell-action:0.1">

    <exec>hive.sh</exec> 
    <file>/user/lib/hive.sh#hive.sh</file>
    <file>ETL_file1.hql#hivescript</file>

</shell>

One thing you will notice is that the exec is just hive.sh since we are assuming that the file will be moved to the base directory where the shell action is completed

你会注意到的一件事是 exec 只是 hive.sh 因为我们假设文件将被移动到 shell 操作完成的基本目录

To make sure that last note is true, you must include the file's hdfs path, this will force oozie to distribute that file with the action. In your case, the hive script launcher should only be coded once, and simply fed different files.Since we have a one to many relationship, the hive.sh should be kept in a lib and not distributed with every workflow.

为了确保最后一个注释是正确的,您必须包含文件的 hdfs 路径,这将强制 oozie 使用该操作分发该文件。在您的情况下,hive 脚本启动器应该只编码一次,并且只需提供不同的文件。由于我们有一对多的关系,hive.sh 应该保存在一个库中,而不是随每个工作流分发。

Lastly you see the line:

最后你会看到一行:

<file>ETL_file1.hql#hivescript</file>

This line does two things. Before the # we have the location of the file. It is just the file name since we should distribute our distinct hive files with our workflows

这条线做了两件事。在# 之前,我们有文件的位置。这只是文件名,因为我们应该使用我们的工作流程分发我们不同的 hive 文件

user/directory/workflow.xml
user/directory/ETL_file1.hql

and the node running the sh will have this distributed to it automagically. Lastly, the part after the # is the variable name we assign it two inside of the sh script. This gives you the ability to reuse the same script over and over and simply feed it different files.

并且运行 sh 的节点将自动将其分发给它。最后,# 后面的部分是我们在 sh 脚本中为它分配的两个变量名。这使您能够一遍又一遍地重复使用相同的脚本,并简单地向它提供不同的文件。

HDFS directory notes,

HDFS 目录注释,

if the file is nested inside the same directory as the workflow, then you only need to specify child paths:

如果文件嵌套在与工作流相同的目录中,那么您只需要指定子路径:

user/directory/workflow.xml
user/directory/hive/ETL_file1.hql

Would yield:

会产生:

<file>hive/ETL_file1.hql#hivescript</file>

But if the path is outside of the workflow directory you will need the full path:

但如果路径在工作流目录之外,您将需要完整路径:

user/directory/workflow.xml
user/lib/hive.sh

would yield:

会产生:

<file>/user/lib/hive.sh#hive.sh</file>

I hope this helps everyone.

我希望这对大家有帮助。

回答by user2230605

From

http://oozie.apache.org/docs/3.3.0/DG_ShellActionExtension.html#Shell_Action_Schema_Version_0.2

http://oozie.apache.org/docs/3.3.0/DG_ShellActionExtension.html#Shell_Action_Schema_Version_0.2

If you keep your shell script and hive script both in some folder in workflow then you can execute it.

如果您将 shell 脚本和 hive 脚本都保存在工作流的某个文件夹中,则可以执行它。

See the command in sample

查看示例中的命令

<exec>${EXEC}</exec>
        <argument>A</argument>
        <argument>B</argument>
        <file>${EXEC}#${EXEC}</file> <!--Copy the executable to compute node's current     working directory -->

you can write whatever commands you want in file

你可以在文件中写任何你想要的命令

You can also use use hive action directly

您也可以直接使用 use hive action

http://oozie.apache.org/docs/3.3.0/DG_HiveActionExtension.html

http://oozie.apache.org/docs/3.3.0/DG_HiveActionExtension.html