如何在 Amazon EMR 上引导安装 Python 模块?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31525012/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to bootstrap installation of Python modules on Amazon EMR?
提问by Evan Zamir
I want to do something really basic, simply fire up a Spark cluster through the EMR console and run a Spark script that depends on a Python package (for example, Arrow). What is the most straightforward way of doing this?
我想做一些非常基本的事情,只需通过 EMR 控制台启动 Spark 集群并运行依赖于 Python 包(例如Arrow)的 Spark 脚本。这样做的最直接方法是什么?
采纳答案by noli
The most straightforward way would be to create a bash script containing your installation commands, copy it to S3, and set a bootstrap action from the console to point to your script.
最直接的方法是创建一个包含安装命令的 bash 脚本,将其复制到 S3,然后从控制台设置引导操作以指向您的脚本。
Here's an example I'm using in production:
这是我在生产中使用的示例:
s3://mybucket/bootstrap/install_python_modules.sh
s3://mybucket/bootstrap/install_python_modules.sh
#!/bin/bash -xe
# Non-standard and non-Amazon Machine Image Python modules:
sudo pip install -U \
awscli \
boto \
ciso8601 \
ujson \
workalendar
sudo yum install -y python-psycopg2
回答by Craig F
In short, there are two ways to install packages with pip, depending on the platform. First, you install whatever you need and then you can run your Spark step. The easiest is to use emr-4.0.0 and 'command-runner.jar':
简而言之,根据平台的不同,使用 pip 安装包有两种方法。首先,安装所需的任何内容,然后可以运行 Spark 步骤。最简单的方法是使用 emr-4.0.0 和 'command-runner.jar':
from boto.emr.step import JarStep
>>> pip_step=JarStep(name="Command Runner",
... jar="command-runner.jar",
... action_on_failure="CONTINUE",
... step_args=['sudo','pip','install','arrow']
... )
>>> spark_step=JarStep(name="Spark with Command Runner",
... jar="command-runner.jar",
... step_args=["spark-submit","/usr/lib/spark/examples/src/main/python/pi.py"]
... action_on_failure="CONTINUE"
)
>>> step_list=conn.add_jobflow_steps(emr.jobflowid, [pip_step,spark_step])
On 2.x and 3.x, you use script-runner.jar in a similar fashion except that you have to specify the full URI for scriptrunner.
在 2.x 和 3.x 上,您以类似的方式使用 script-runner.jar,只是您必须为scriptrunner指定完整的 URI 。
EDIT: Sorry, I didn't see that you wanted to do this through console. You can add the same steps in the console as well. The first step would be a Customer JAR with the same args as above. The second step is a spark step. Hope this helps!
编辑:抱歉,我没有看到您想通过控制台执行此操作。您也可以在控制台中添加相同的步骤。第一步将是具有与上述相同参数的客户 JAR。第二步是火花步骤。希望这可以帮助!
回答by Jonathan Taws
Depending if you are using Python 2 (default in EMR) or Python 3, the pip install command should be different. As recommended in noli's answer, you should create a shell script, upload it to a bucket in S3, and use it as a Bootstrap action.
根据您使用的是 Python 2(EMR 中的默认设置)还是 Python 3,pip install 命令应该有所不同。按照noli 的回答中的建议,您应该创建一个 shell 脚本,将其上传到 S3 中的存储桶,并将其用作Bootstrap 操作。
For Python 2 (in Jupyter: used as default for pyspark kernel):
对于 Python 2(在 Jupyter 中:用作 pyspark 内核的默认值):
#!/bin/bash -xe
sudo pip install your_package
For Python 3 (in Jupyter: used as default for Python 3 and pyspark3kernel):
对于 Python 3(在 Jupyter 中:用作 Python 3 和pyspark3内核的默认值):
#!/bin/bash -xe
sudo pip-3.4 install your_package