如何在 Amazon EMR 上引导安装 Python 模块？

Question

提问by Evan Zamir

I want to do something really basic, simply fire up a Spark cluster through the EMR console and run a Spark script that depends on a Python package (for example, Arrow). What is the most straightforward way of doing this?

我想做一些非常基本的事情，只需通过 EMR 控制台启动 Spark 集群并运行依赖于 Python 包（例如Arrow）的 Spark 脚本。这样做的最直接方法是什么？

Answer 1

采纳答案by noli

The most straightforward way would be to create a bash script containing your installation commands, copy it to S3, and set a bootstrap action from the console to point to your script.

最直接的方法是创建一个包含安装命令的 bash 脚本，将其复制到 S3，然后从控制台设置引导操作以指向您的脚本。

Here's an example I'm using in production:

这是我在生产中使用的示例：

s3://mybucket/bootstrap/install_python_modules.sh

#!/bin/bash -xe

# Non-standard and non-Amazon Machine Image Python modules:
sudo pip install -U \
  awscli            \
  boto              \
  ciso8601          \
  ujson             \
  workalendar

sudo yum install -y python-psycopg2

Answer 2

回答by Craig F

In short, there are two ways to install packages with pip, depending on the platform. First, you install whatever you need and then you can run your Spark step. The easiest is to use emr-4.0.0 and 'command-runner.jar':

简而言之，根据平台的不同，使用 pip 安装包有两种方法。首先，安装所需的任何内容，然后可以运行 Spark 步骤。最简单的方法是使用 emr-4.0.0 和 'command-runner.jar'：

from boto.emr.step import JarStep
>>> pip_step=JarStep(name="Command Runner",
...             jar="command-runner.jar",
...             action_on_failure="CONTINUE",
...             step_args=['sudo','pip','install','arrow']
... )
>>> spark_step=JarStep(name="Spark with Command Runner",
...                    jar="command-runner.jar",
...                    step_args=["spark-submit","/usr/lib/spark/examples/src/main/python/pi.py"]
...                    action_on_failure="CONTINUE"
)
>>> step_list=conn.add_jobflow_steps(emr.jobflowid, [pip_step,spark_step])

On 2.x and 3.x, you use script-runner.jar in a similar fashion except that you have to specify the full URI for scriptrunner.

在 2.x 和 3.x 上，您以类似的方式使用 script-runner.jar，只是您必须为scriptrunner指定完整的 URI 。

EDIT: Sorry, I didn't see that you wanted to do this through console. You can add the same steps in the console as well. The first step would be a Customer JAR with the same args as above. The second step is a spark step. Hope this helps!

编辑：抱歉，我没有看到您想通过控制台执行此操作。您也可以在控制台中添加相同的步骤。第一步将是具有与上述相同参数的客户 JAR。第二步是火花步骤。希望这可以帮助！

Answer 3

回答by Jonathan Taws

Depending if you are using Python 2 (default in EMR) or Python 3, the pip install command should be different. As recommended in noli's answer, you should create a shell script, upload it to a bucket in S3, and use it as a Bootstrap action.

根据您使用的是 Python 2（EMR 中的默认设置）还是 Python 3，pip install 命令应该有所不同。按照noli 的回答中的建议，您应该创建一个 shell 脚本，将其上传到 S3 中的存储桶，并将其用作Bootstrap 操作。

For Python 2 (in Jupyter: used as default for pyspark kernel):

对于 Python 2（在 Jupyter 中：用作 pyspark 内核的默认值）：

#!/bin/bash -xe
sudo pip install your_package

For Python 3 (in Jupyter: used as default for Python 3 and pyspark3kernel):

对于 Python 3（在 Jupyter 中：用作 Python 3 和pyspark3内核的默认值）：

#!/bin/bash -xe
sudo pip-3.4 install your_package

如何在 Amazon EMR 上引导安装 Python 模块？

提问by Evan Zamir

采纳答案by noli

回答by Craig F

回答by Jonathan Taws

相关推荐

最近更新

标签

如何在 Amazon EMR 上引导安装 Python 模块？

提问by Evan Zamir

采纳答案by noli

回答by Craig F

回答by Jonathan Taws

相关推荐

Python 使用另一个数据框的索引创建一个空数据框

Python 在新的熊猫数据框列中计算以年、月等为单位的日期时间差

在基于 Python 文本的 GUI (TUI) 中输入

为变量分配范围 (Python)

相关推荐

最近更新

标签