Python 我似乎无法在 Spark 上使用 --py-files
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36461054/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
I can't seem to get --py-files on Spark to work
提问by Andrej Palicka
I'm having a problem with using Python on Spark. My application has some dependencies, such as numpy, pandas, astropy, etc. I cannot use virtualenv to create an environment with all dependencies, since the nodes on the cluster do not have any common mountpoint or filesystem, besides HDFS. Therefore I am stuck with using spark-submit --py-files
. I package the contents of site-packages in a ZIP file and submit the job like with --py-files=dependencies.zip
option (as suggested in Easiest way to install Python dependencies on Spark executor nodes?). However, the nodes on cluster still do not seem to see the modules inside and they throw ImportError
such as this when importing numpy.
我在 Spark 上使用 Python 时遇到问题。我的应用程序有一些依赖项,例如 numpy、pandas、astropy 等。我不能使用 virtualenv 创建具有所有依赖项的环境,因为除了 HDFS 之外,集群上的节点没有任何公共挂载点或文件系统。因此我坚持使用spark-submit --py-files
. 我将站点包的内容打包在一个 ZIP 文件中,并像使用--py-files=dependencies.zip
选项一样提交作业(如在 Spark 执行器节点上安装 Python 依赖项的最简单方法中所建议的那样?)。但是,集群上的节点似乎仍然没有看到内部的模块,并且ImportError
在导入 numpy.
File "/path/anonymized/module.py", line 6, in <module>
import numpy
File "/tmp/pip-build-4fjFLQ/numpy/numpy/__init__.py", line 180, in <module>
File "/tmp/pip-build-4fjFLQ/numpy/numpy/add_newdocs.py", line 13, in <module>
File "/tmp/pip-build-4fjFLQ/numpy/numpy/lib/__init__.py", line 8, in <module>
#
File "/tmp/pip-build-4fjFLQ/numpy/numpy/lib/type_check.py", line 11, in <module>
File "/tmp/pip-build-4fjFLQ/numpy/numpy/core/__init__.py", line 14, in <module>
ImportError: cannot import name multiarray
When I switch to the virtualenv and use the local pyspark shell, everything works fine, so the dependencies are all there. Does anyone know, what might cause this problem and how to fix it?
当我切换到 virtualenv 并使用本地 pyspark shell 时,一切正常,因此依赖项都在那里。有谁知道,可能导致此问题的原因以及如何解决?
Thanks!
谢谢!
回答by ramhiser
First off, I'll assume that your dependencies are listed in requirements.txt
. To package and zip the dependencies, run the following at the command line:
首先,我假设您的依赖项列在requirements.txt
. 要打包和压缩依赖项,请在命令行中运行以下命令:
pip install -t dependencies -r requirements.txt
cd dependencies
zip -r ../dependencies.zip .
Above, the cd dependencies
command is crucial to ensure that the modules are the in the top level of the zip file. Thanks to Dan Corin's postfor heads up.
上面,该cd dependencies
命令对于确保模块位于 zip 文件的顶层至关重要。感谢Dan Corin 的提醒。
Next, submit the job via:
接下来,通过以下方式提交作业:
spark-submit --py-files dependencies.zip spark_job.py
The --py-files
directive sends the zip file to the Spark workers but does not add it to the PYTHONPATH
(source of confusion for me). To add the dependencies to the PYTHONPATH
to fix the ImportError
, add the following line to the Spark job, spark_job.py
:
该--py-files
指令将 zip 文件发送给 Spark 工作人员,但没有将其添加到PYTHONPATH
(对我来说是混乱的来源)。要将依赖项添加到PYTHONPATH
以修复ImportError
,请将以下行添加到 Spark 作业spark_job.py
:
sc.addPyFile("dependencies.zip")
A caveat from this Cloudera post:
An assumption that anyone doing distributed computing with commodity hardware must assume is that the underlying hardware is potentially heterogeneous. A Python egg built on a client machine will be specific to the client's CPU architecture because of the required C compilation. Distributing an egg for a complex, compiled package like NumPy, SciPy, or pandas is a brittle solution that is likely to fail on most clusters, at least eventually.
任何使用商品硬件进行分布式计算的人都必须假设底层硬件可能是异构的。由于需要 C 编译,因此在客户端机器上构建的 Python egg 将特定于客户端的 CPU 架构。为 NumPy、SciPy 或 Pandas 等复杂的编译包分发鸡蛋是一种脆弱的解决方案,可能在大多数集群上失败,至少最终是这样。
Although the solution above does not build an egg, the same guideline applies.
虽然上面的解决方案没有构建一个鸡蛋,但同样的准则也适用。
回答by avrsanjay
First you need to pass your files through --py-filesor --files
- When you pass your zip/files with the above flags, basically your resources will be transferred to temporary directory created on HDFS just for the lifetime of that application.
Now in your code, add those zip/files by using the following command
sc.addPyFile("your zip/file")
- what the above does is, it loads the files to the execution environment, like JVM.
Now import your zip/file in your code with an alias like the following to start referencing it
import zip/file as your-alias
Note: You need not use file extension while importing, like .pyat the end
首先,您需要通过--py-files或--files传递文件
- 当您传递带有上述标志的 zip/文件时,基本上您的资源将被传输到在 HDFS 上创建的临时目录,仅用于该应用程序的生命周期。
现在在您的代码中,使用以下命令添加这些 zip/文件
sc.addPyFile("your zip/file")
- 上面所做的是,它将文件加载到执行环境中,例如 JVM。
现在使用如下别名在代码中导入您的 zip/文件以开始引用它
import zip/file as your-alias
注意:导入时不需要使用文件扩展名,如末尾的.py
Hope this is useful.
希望这是有用的。
回答by Nathan Buesgens
Spark will also silently fail to load a zip archive that is created with the python zipfile
module. Zip archives must be created using a zip utility.
Spark 也会默默地无法加载使用 pythonzipfile
模块创建的 zip 存档。必须使用 zip 实用程序创建 Zip 存档。
回答by Graham
Try to use --archives
to archive your anaconda dir to each server
and use --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=
to tell your spark server where is your python executor path in your anaconda dir.
尝试使用 --archives
将您的 anaconda 目录存档到每个服务器,并用于--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=
告诉您的 Spark 服务器您的 anaconda 目录中的 python 执行程序路径在哪里。
Our full config is this:
我们的完整配置是这样的:
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./ANACONDA/anaconda-dependencies/bin/python
--archives <S3-path>/anaconda-dependencies.zip#ANACONDA
回答by kpie
You can locate all the .pys you need and add them relatively. see herefor this explanation:
您可以找到您需要的所有 .pys 并相对添加它们。 有关此说明,请参见此处:
import os, sys, inspect
# realpath() will make your script run, even if you symlink it :)
cmd_folder = os.path.realpath(os.path.abspath(os.path.split(inspect.getfile( inspect.currentframe() ))[0]))
if cmd_folder not in sys.path:
sys.path.insert(0, cmd_folder)
# use this if you want to include modules from a subfolder
cmd_subfolder = os.path.realpath(os.path.abspath(os.path.join(os.path.split(inspect.getfile( inspect.currentframe() ))[0],"subfolder")))
if cmd_subfolder not in sys.path:
sys.path.insert(0, cmd_subfolder)
# Info:
# cmd_folder = os.path.dirname(os.path.abspath(__file__)) # DO NOT USE __file__ !!!
# __file__ fails if script is called in different ways on Windows
# __file__ fails if someone does os.chdir() before
# sys.argv[0] also fails because it doesn't not always contains the path